ai-gpu
AI Model VRAM Calculator
Calculate the GPU VRAM required to run any LLM locally or on cloud GPU. Supports all quantization levels — FP32, FP16, INT8, INT4, and GGUF variants.
GPU VRAM Requirements for LLM Inference
Running large language models requires GPUs with sufficient VRAM to hold the model weights and runtime state. VRAM is the #1 constraint when deploying LLMs.
VRAM Formula
Total VRAM = Model weights + KV cache + Runtime overhead
Model weights (GB) = Parameters (B) × bytes per parameter
- ›FP32: 4 bytes → 7B model = 28 GB
- ›FP16/BF16: 2 bytes → 7B model = 14 GB
- ›INT8: 1 byte → 7B model = 7 GB
- ›INT4/GGUF-Q4: 0.5 bytes → 7B model = 3.5 GB
GPU VRAM Reference
| GPU | VRAM | Max Model (FP16) | Max Model (INT4) |
|---|---|---|---|
| RTX 4060 Ti | 16 GB | ~7B | ~30B |
| RTX 4090 | 24 GB | ~12B | ~46B |
| A100 40 GB | 40 GB | ~20B | ~70B |
| A100 80 GB | 80 GB | ~40B | ~140B |
| H100 80 GB | 80 GB | ~40B | ~140B |
KV Cache Scaling
KV cache grows with context length and batch size. At 128K context, a 70B model needs 40+ GB for KV cache alone. This is why inference servers like vLLM use PagedAttention to manage KV cache more efficiently.
Recommended Deployment Paths
- ›≤8 GB VRAM: GGUF Q4 models via Ollama (7B class, personal use)
- ›16–24 GB: 7B FP16, 13B INT8 models, small production deployments
- ›40–80 GB: 70B INT4, 34B FP16, serious inference workloads
- ›Multi-GPU: 70B+ models in FP16, distributed inference
Frequently Asked Questions
How much VRAM does Llama 3 70B need?
At FP16, Llama 3 70B requires ~140 GB VRAM — requiring 2× A100 80 GB or 4× H100 40 GB. At INT4 quantization (GGUF Q4), it fits in ~35–40 GB VRAM, making a single A100 40 GB or 2× RTX 4090 workable.
What is the minimum GPU to run Llama 3 8B?
Llama 3 8B at FP16 requires ~16 GB VRAM — fitting on an RTX 4080, RTX 3090, or A4000. At INT4/GGUF Q4, it requires only ~4–5 GB, running on consumer GPUs like the RTX 3060 12 GB.
How does quantization affect VRAM requirements?
FP16 uses 2 bytes per parameter. INT8 uses 1 byte (2× reduction). INT4 uses 0.5 bytes (4× reduction). A 7B model needs ~14 GB at FP16 but only ~3.5 GB at INT4, at the cost of slight accuracy degradation.
How does context length affect VRAM?
Longer context windows increase KV (key-value) cache memory usage. For a 7B model at 4K context: ~1 GB KV cache. At 32K context: ~8 GB KV cache. At 128K context: ~30+ GB. This is significant for large models with long contexts.