Goal-based local AI planner Beta

Can your machine run local AI?

Pick a goal, choose your hardware, and get practical model recommendations for VRAM, context, and speed.

Local AI VRAM Calculator & GPU Planner uses planning estimates, not guarantees. Real performance depends on runtime, drivers, cooling, and background load.

Your Setup

GPU specs come from a curated TechPowerUp GPU Database snapshot. Apple Silicon presets use unified memory and Apple-published bandwidth.

Favors code-capable models, longer context, and usable reply speed for agent loops.

Hardware Configuration

Counts identical GPUs and multiplies VRAM and bandwidth.

8 GB per stick.

Used to show the RAM split across sticks.

Advanced Options

No explicit context-window metadata is available for the current LLM set, so the full selector remains visible.

Custom Model Data

Paste a public Hugging Face model URL or repo ID. Public Hub metadata is used for the planning estimate.

Public Hub metadata and config.json are used for the estimate when available.

Using the curated model snapshot.

Recommended Models

Why these models?

These recommendations are ranked for Coding Agent using your selected hardware, Q4_K_M quantization, and 65,536 tokens of context. The planner favors models that fit, leave practical headroom, and meet the speed target for the selected use case.

Recommendations rank fit, quality, context support, quantization, and estimated speed. Coding Agent rankings also favor models that are likely to work well with agentic coding tools. Benchmark scores are used when available, with size-based fallbacks for unbenchmarked models.

Hardware Profile

Selected GPU
Manual VRAM / unified memory entry
Specs
Using 8 GB of selected GPU VRAM or unified memory.
Local AI tier
Constrained local AI tier
Hardware read
Manual mode can judge capacity, but not bus width, bandwidth, power, or runtime support.
Source
Manual entry, no GPU source selected.

TTFS vs TPS

TTFS, or time to first token, is what you feel when a model has to read a lot of prompt, chat history, documents, or tool output before it starts answering. It matters most for short, repeated operations where waiting for the response to start is the annoying part.

TPS is the generation speed after the model starts writing. It matters more when the answer is long or when a coding agent is producing, editing, and explaining larger chunks of code. The effective reply TPS shown above combines prompt-processing time and generation time, so long-context workflows can feel slower than the raw output TPS suggests.

Storage Notes

Storage is not part of the score because it does not usually determine whether a model fits or how fast tokens generate after the model is loaded.

It still matters for practical capacity. Keep room for model weights, quantized variants, image checkpoints, embeddings, document indexes, and temporary downloads. A small local setup can consume tens of gigabytes quickly; image workflows and multiple model families can push into hundreds of gigabytes.

CPU, RAM, GPU, and VRAM

Local AI performance is usually limited by where the model fits, then by how fast the machine can move data through it.

CPU

The CPU coordinates the workload. It matters for loading models, tokenization, background services, document indexing, and CPU-only inference.

A better CPU helps the whole machine feel responsive, but it will not fix a model that is too large for your GPU VRAM.

System RAM

System RAM holds the operating system, apps, model files while loading, document indexes, and any model layers that spill out of VRAM.

More RAM helps with multitasking and larger local workflows, but RAM is much slower than VRAM for GPU inference.

GPU + VRAM

The GPU does the heavy math. VRAM is the memory attached to the GPU, and it is usually the hard limit for local LLMs and image generation.

If the model fits in VRAM, it is usually much faster. If it does not, performance can fall off quickly.

Part What it helps with What it does not solve
CPU General responsiveness, CPU inference, indexing, orchestration. Making oversized GPU models fit in VRAM.
System RAM Multitasking, loading models, RAG indexes, overflow from VRAM. Replacing VRAM for fast GPU inference.
GPU Fast inference, image generation, parallel AI workloads. Running larger models if the card lacks enough VRAM.
VRAM Fitting models, context, and image workloads close to the GPU. Fixing weak cooling, power limits, or slow storage.

Related Notes

Methodology and Sources

Plain-English version

  • VRAM decides what fits close to the GPU.
  • Memory bandwidth has a large effect on token speed.
  • Longer context uses more memory through the K/V cache.
  • Quantization trades some quality for lower memory use.
  • Scores are planning estimates, not performance guarantees.

Model fit estimates use a simplified inference VRAM formula: quantized model weights plus K/V cache plus runtime overhead. The model-weight and K/V cache structure follows Wei Ming Thor's ApX Machine Learning guide to calculating LLM VRAM requirements.

K/V cache estimates use model architecture metadata from Hugging Face when available: layers, K/V heads, head dimension, context length, and cache bytes per element. The K/V quantization choices are also checked against Sam McLeod's Ollama K/V cache quantization notes. Treat the result as a planning heuristic because exact usage varies by backend, batching, memory fragmentation, and model implementation.

Model recommendations are ranked by a heuristic rating that combines model fit, estimated speed against the selected input and output TPS targets, context support, quantization penalty, and model quality. Model quality uses task-specific benchmark values when they are present: LiveBench Coding for coding agents and tool use, LiveBench Reasoning for RAG, UGI NatInt for chat, and UGI Writing for creative writing and roleplay. Curated model scores come from the public LiveBench release CSVs and the UGI leaderboard CSV, with LiveBench category scores averaged across the relevant task columns. Models without benchmark values fall back to a size-based quality estimate using the natural log of effective parameter count. For the Coding Agent goal, the ranking also favors models that are more likely to work well with agentic coding tools such as OpenCode, Claude, and Codex. MoE models can include separate total and active parameter counts: total parameters drive memory fit, while active parameters influence generation speed, and the geometric mean of total and active parameters is used for size-based quality fallback scoring. For a plain-English overview of that Dense versus MoE tradeoff, see Maximilian Schwarzmuller's Mixture of Experts versus Dense LLMs explainer.