Goal-based local AI planner Beta

Can your machine run local AI?

Pick a goal, choose your hardware, and get practical model recommendations for VRAM, context, and speed.

Local AI VRAM Calculator & GPU Planner uses planning estimates, not guarantees. Real performance depends on runtime, drivers, cooling, and background load.

Your Setup

Hardware type

GPU

GPU specs come from a curated TechPowerUp GPU Database snapshot. Apple Silicon presets use unified memory and Apple-published bandwidth.

Primary goal

Favors code-capable models, longer context, and usable reply speed for agent loops.

Hardware Configuration

GPU VRAM

Number of GPUs

Counts identical GPUs and multiplies VRAM and bandwidth.

Total RAM

8 GB per stick.

Number of RAM sticks

Used to show the RAM split across sticks.

Advanced Options

Memory bandwidth GB/s

Model quantization?

Context size?

No explicit context-window metadata is available for the current LLM set, so the full selector remains visible.

Minimum input TPS

Target input TPS

Minimum output TPS

Target output TPS

Input:output ratio

Performance margin

Desktop VRAM reserve

K/V cache?

Custom Model Data

Paste a public Hugging Face model URL or repo ID. Public Hub metadata is used for the planning estimate.

Hugging Face model

Public Hub metadata and config.json are used for the estimate when available.

Using the curated model snapshot.

Recommended Models

Why these models?

These recommendations are ranked for Coding Agent using your selected hardware, Q4_K_M quantization, and 65,536 tokens of context. The planner favors models that fit, leave practical headroom, and meet the speed target for the selected use case.

Recommendations rank fit, quality, context support, quantization, and estimated speed. Coding Agent rankings also favor models that are likely to work well with agentic coding tools. Benchmark scores are used when available, with size-based fallbacks for unbenchmarked models.

Hardware Profile

Selected GPU: Manual VRAM / unified memory entry
Specs: Using 8 GB of selected GPU VRAM or unified memory.
Local AI tier: Constrained local AI tier
Hardware read: Manual mode can judge capacity, but not bus width, bandwidth, power, or runtime support.
Source: Manual entry, no GPU source selected.

TTFS vs TPS

TTFS, or time to first token, is what you feel when a model has to read a lot of prompt, chat history, documents, or tool output before it starts answering. It matters most for short, repeated operations where waiting for the response to start is the annoying part.

TPS is the generation speed after the model starts writing. It matters more when the answer is long or when a coding agent is producing, editing, and explaining larger chunks of code. The effective reply TPS shown above combines prompt-processing time and generation time, so long-context workflows can feel slower than the raw output TPS suggests.

Storage Notes

Storage is not part of the score because it does not usually determine whether a model fits or how fast tokens generate after the model is loaded.

It still matters for practical capacity. Keep room for model weights, quantized variants, image checkpoints, embeddings, document indexes, and temporary downloads. A small local setup can consume tens of gigabytes quickly; image workflows and multiple model families can push into hundreds of gigabytes.

CPU, RAM, GPU, and VRAM

Local AI performance is usually limited by where the model fits, then by how fast the machine can move data through it.

CPU

The CPU coordinates the workload. It matters for loading models, tokenization, background services, document indexing, and CPU-only inference.

A better CPU helps the whole machine feel responsive, but it will not fix a model that is too large for your GPU VRAM.

System RAM

System RAM holds the operating system, apps, model files while loading, document indexes, and any model layers that spill out of VRAM.

More RAM helps with multitasking and larger local workflows, but RAM is much slower than VRAM for GPU inference.

GPU + VRAM

The GPU does the heavy math. VRAM is the memory attached to the GPU, and it is usually the hard limit for local LLMs and image generation.

If the model fits in VRAM, it is usually much faster. If it does not, performance can fall off quickly.

Part	What it helps with	What it does not solve
CPU	General responsiveness, CPU inference, indexing, orchestration.	Making oversized GPU models fit in VRAM.
System RAM	Multitasking, loading models, RAG indexes, overflow from VRAM.	Replacing VRAM for fast GPU inference.
GPU	Fast inference, image generation, parallel AI workloads.	Running larger models if the card lacks enough VRAM.
VRAM	Fitting models, context, and image workloads close to the GPU.	Fixing weak cooling, power limits, or slow storage.

Related Notes

How I Set Up Tailscale for Secure Tunneling and Accessing Private LLMs

A practical setup for reaching Ollama on a home PC through Tailscale.

From AMD to NVIDIA: Switching from the RX 6700 XT to the RTX 4070 Ti

A real-world GPU upgrade comparison with power, platform, and feature tradeoffs.

AMD RX 6700 XT Overclocking: Unlocking Max Performance

Notes on power limits, thermals, and careful GPU tuning from an older build.

Methodology and Sources

Plain-English version

VRAM decides what fits close to the GPU.
Memory bandwidth has a large effect on token speed.
Longer context uses more memory through the K/V cache.
Quantization trades some quality for lower memory use.
Scores are planning estimates, not performance guarantees.

Model fit estimates use a simplified inference VRAM formula: quantized model weights plus K/V cache plus runtime overhead. The model-weight and K/V cache structure follows Wei Ming Thor's ApX Machine Learning guide to calculating LLM VRAM requirements.

K/V cache estimates use model architecture metadata from Hugging Face when available: layers, K/V heads, head dimension, context length, and cache bytes per element. The K/V quantization choices are also checked against Sam McLeod's Ollama K/V cache quantization notes. Treat the result as a planning heuristic because exact usage varies by backend, batching, memory fragmentation, and model implementation.

Model recommendations are ranked by a heuristic rating that combines model fit, estimated speed against the selected input and output TPS targets, context support, quantization penalty, and model quality. Model quality uses task-specific benchmark values when they are present: LiveBench Coding for coding agents and tool use, LiveBench Reasoning for RAG, UGI NatInt for chat, and UGI Writing for creative writing and roleplay. Curated model scores come from the public LiveBench release CSVs and the UGI leaderboard CSV, with LiveBench category scores averaged across the relevant task columns. Models without benchmark values fall back to a size-based quality estimate using the natural log of effective parameter count. For the Coding Agent goal, the ranking also favors models that are more likely to work well with agentic coding tools such as OpenCode, Claude, and Codex. MoE models can include separate total and active parameter counts: total parameters drive memory fit, while active parameters influence generation speed, and the geometric mean of total and active parameters is used for size-based quality fallback scoring. For a plain-English overview of that Dense versus MoE tradeoff, see Maximilian Schwarzmuller's Mixture of Experts versus Dense LLMs explainer.