Can your machine run local AI?
Pick a goal, choose your hardware, and get practical model recommendations for VRAM, context, and speed.
Local AI VRAM Calculator & GPU Planner uses planning estimates, not guarantees. Real performance depends on runtime, drivers, cooling, and background load.
Hardware Profile
- Selected GPU
- Manual VRAM / unified memory entry
- Specs
- Using 8 GB of selected GPU VRAM or unified memory.
- Local AI tier
- Constrained local AI tier
- Hardware read
- Manual mode can judge capacity, but not bus width, bandwidth, power, or runtime support.
- Source
- Manual entry, no GPU source selected.
CPU, RAM, GPU, and VRAM
Local AI performance is usually limited by where the model fits, then by how fast the machine can move data through it.
| Part | What it helps with | What it does not solve |
|---|---|---|
| CPU | General responsiveness, CPU inference, indexing, orchestration. | Making oversized GPU models fit in VRAM. |
| System RAM | Multitasking, loading models, RAG indexes, overflow from VRAM. | Replacing VRAM for fast GPU inference. |
| GPU | Fast inference, image generation, parallel AI workloads. | Running larger models if the card lacks enough VRAM. |
| VRAM | Fitting models, context, and image workloads close to the GPU. | Fixing weak cooling, power limits, or slow storage. |
Related Notes
How I Set Up Tailscale for Secure Tunneling and Accessing Private LLMs
A practical setup for reaching Ollama on a home PC through Tailscale.
From AMD to NVIDIA: Switching from the RX 6700 XT to the RTX 4070 Ti
A real-world GPU upgrade comparison with power, platform, and feature tradeoffs.
AMD RX 6700 XT Overclocking: Unlocking Max Performance
Notes on power limits, thermals, and careful GPU tuning from an older build.
Methodology and Sources
Plain-English version
- VRAM decides what fits close to the GPU.
- Memory bandwidth has a large effect on token speed.
- Longer context uses more memory through the K/V cache.
- Quantization trades some quality for lower memory use.
- Scores are planning estimates, not performance guarantees.
Model fit estimates use a simplified inference VRAM formula: quantized model weights plus K/V cache plus runtime overhead. The model-weight and K/V cache structure follows Wei Ming Thor's ApX Machine Learning guide to calculating LLM VRAM requirements.
K/V cache estimates use model architecture metadata from Hugging Face when available: layers, K/V heads, head dimension, context length, and cache bytes per element. The K/V quantization choices are also checked against Sam McLeod's Ollama K/V cache quantization notes. Treat the result as a planning heuristic because exact usage varies by backend, batching, memory fragmentation, and model implementation.
Model recommendations are ranked by a heuristic rating that combines model fit, estimated speed against the selected input and output TPS targets, context support, quantization penalty, and model quality. Model quality uses task-specific benchmark values when they are present: LiveBench Coding for coding agents and tool use, LiveBench Reasoning for RAG, UGI NatInt for chat, and UGI Writing for creative writing and roleplay. Curated model scores come from the public LiveBench release CSVs and the UGI leaderboard CSV, with LiveBench category scores averaged across the relevant task columns. Models without benchmark values fall back to a size-based quality estimate using the natural log of effective parameter count. For the Coding Agent goal, the ranking also favors models that are more likely to work well with agentic coding tools such as OpenCode, Claude, and Codex. MoE models can include separate total and active parameter counts: total parameters drive memory fit, while active parameters influence generation speed, and the geometric mean of total and active parameters is used for size-based quality fallback scoring. For a plain-English overview of that Dense versus MoE tradeoff, see Maximilian Schwarzmuller's Mixture of Experts versus Dense LLMs explainer.