ai-gpu

AI Model VRAM Calculator

Calculate the GPU VRAM required to run any LLM locally or on cloud GPU. Supports all quantization levels — FP32, FP16, INT8, INT4, and GGUF variants.

GPU VRAM Requirements for LLM Inference

Running large language models requires GPUs with sufficient VRAM to hold the model weights and runtime state. VRAM is the #1 constraint when deploying LLMs.

VRAM Formula

Total VRAM = Model weights + KV cache + Runtime overhead

Model weights (GB) = Parameters (B) × bytes per parameter

›FP32: 4 bytes → 7B model = 28 GB
›FP16/BF16: 2 bytes → 7B model = 14 GB
›INT8: 1 byte → 7B model = 7 GB
›INT4/GGUF-Q4: 0.5 bytes → 7B model = 3.5 GB

GPU VRAM Reference

GPU	VRAM	Max Model (FP16)	Max Model (INT4)
RTX 4060 Ti	16 GB	~7B	~30B
RTX 4090	24 GB	~12B	~46B
A100 40 GB	40 GB	~20B	~70B
A100 80 GB	80 GB	~40B	~140B
H100 80 GB	80 GB	~40B	~140B

KV Cache Scaling

KV cache grows with context length and batch size. At 128K context, a 70B model needs 40+ GB for KV cache alone. This is why inference servers like vLLM use PagedAttention to manage KV cache more efficiently.

Recommended Deployment Paths

›≤8 GB VRAM: GGUF Q4 models via Ollama (7B class, personal use)
›16–24 GB: 7B FP16, 13B INT8 models, small production deployments
›40–80 GB: 70B INT4, 34B FP16, serious inference workloads
›Multi-GPU: 70B+ models in FP16, distributed inference

Key Terms

Full glossary →

VRAM (Video RAM)

Memory on a GPU used to store model weights, activations, and KV cache during LLM inference. VRAM is the primary constraint when running large language models locally.

Quantization

A technique to reduce model memory usage by representing weights in lower precision (INT8, INT4, GGUF-Q4). Quantization trades a small accuracy loss for significant VRAM reduction.

Frequently Asked Questions

How much VRAM does Llama 3 70B need?

At FP16, Llama 3 70B requires ~140 GB VRAM — requiring 2× A100 80 GB or 4× H100 40 GB. At INT4 quantization (GGUF Q4), it fits in ~35–40 GB VRAM, making a single A100 40 GB or 2× RTX 4090 workable.

What is the minimum GPU to run Llama 3 8B?

Llama 3 8B at FP16 requires ~16 GB VRAM — fitting on an RTX 4080, RTX 3090, or A4000. At INT4/GGUF Q4, it requires only ~4–5 GB, running on consumer GPUs like the RTX 3060 12 GB.

How does quantization affect VRAM requirements?

FP16 uses 2 bytes per parameter. INT8 uses 1 byte (2× reduction). INT4 uses 0.5 bytes (4× reduction). A 7B model needs ~14 GB at FP16 but only ~3.5 GB at INT4, at the cost of slight accuracy degradation.

How does context length affect VRAM?

Longer context windows increase KV (key-value) cache memory usage. For a 7B model at 4K context: ~1 GB KV cache. At 32K context: ~8 GB KV cache. At 128K context: ~30+ GB. This is significant for large models with long contexts.

Related Tools

GPU Cloud Cost

Compare GPU cloud rental costs across RunPod, Lambda Labs, and Vast.ai. Calculate monthly spend for LLM inference, fine-tuning, and ML training workloads.

K8s Cluster Cost

Calculate the monthly cost of running a Kubernetes cluster on Hetzner Cloud. Choose server types for control planes, workers, and load balancers with HA mode.

Related Comparisons

RunPod vs Lambda Labs RunPod vs Vast.ai

Related Guides