ai-gpu
DeepSeek R1 VRAM Calculator
Calculate VRAM for DeepSeek R1 (671B MoE) and its distilled variants (1.5B–70B). The full R1 requires massive multi-GPU setups; distilled versions run on consumer hardware.
DeepSeek R1: Full Model vs Distilled Variants
DeepSeek R1 is a reasoning-first LLM from DeepSeek AI, trained using reinforcement learning to excel at math, code, and logical reasoning. The full 671B MoE model rivals GPT-4o, but the distilled variants are what most engineers will actually run.
VRAM by Model Size at INT4
| Model | Params | VRAM (INT4) | Minimum GPU |
|---|---|---|---|
| R1 full | 671B MoE | ~335 GB | 8× A100 80GB |
| R1-Distill-70B | 70B dense | ~38 GB | A100 40GB |
| R1-Distill-32B | 32B dense | ~18 GB | RTX 4090 24GB (tight) |
| R1-Distill-14B | 14B dense | ~8 GB | RTX 3070 8GB |
| R1-Distill-7B | 7B dense | ~4.5 GB | Any 6GB+ GPU |
| R1-Distill-1.5B | 1.5B dense | ~1 GB | CPU-only feasible |
Why the Distilled Models Are Remarkable
The distilled variants inherit R1's chain-of-thought reasoning style through knowledge distillation. R1-Distill-7B beats GPT-4 on several reasoning benchmarks — running on a consumer RTX 3070.
MoE Memory Note
The full R1 671B is MoE — only ~37B parameters are active per token. But ALL 671B parameters must be in VRAM. INT4 brings the memory footprint from 1.3 TB (FP16) to ~335 GB, which still requires serious multi-GPU hardware.
Key Terms
Full glossary →VRAM (Video RAM)
Memory on a GPU used to store model weights, activations, and KV cache during LLM inference. VRAM is the primary constraint when running large language models locally.
Quantization
A technique to reduce model memory usage by representing weights in lower precision (INT8, INT4, GGUF-Q4). Quantization trades a small accuracy loss for significant VRAM reduction.
Frequently Asked Questions
How much VRAM does DeepSeek R1 671B need?
The full DeepSeek R1 is a 671B Mixture of Experts model. At INT4, it needs ~335 GB VRAM — requiring 8× A100 80GB or 5× H100 80GB. In practice, most users run the distilled variants (7B–70B) which offer strong reasoning on consumer hardware.
What are the DeepSeek R1 distilled models?
DeepSeek released smaller distilled versions trained from R1: R1-Distill-Qwen-1.5B, 7B, 14B, 32B and R1-Distill-Llama-8B, 70B. The 7B distill runs at GGUF Q4 on any 8GB GPU. The 70B distill runs at INT4 on an A100 40GB, with reasoning quality close to the full 671B model.
Is DeepSeek R1 better than GPT-4 for reasoning?
DeepSeek R1 matches or exceeds GPT-4o on AIME 2024, Codeforces, and MATH benchmarks — at a fraction of the training cost. For open-source local deployment, R1-Distill-70B-INT4 is the strongest reasoning model available below $2/hr cloud cost.
How do I run DeepSeek R1 locally?
Use Ollama: `ollama run deepseek-r1:7b` (for the 7B distill, ~4.5GB VRAM) or `ollama run deepseek-r1:70b` (for 70B distill at Q4, ~40GB VRAM). For the full 671B model you need a multi-GPU cluster — use vLLM with tensor parallelism.
Related Tools
AI VRAM
Calculate the GPU VRAM required to run any LLM locally or on cloud GPU. Supports all quantization levels — FP32, FP16, INT8, INT4, and GGUF variants.
GPU Cloud Cost
Compare GPU cloud rental costs across RunPod, Lambda Labs, and Vast.ai. Calculate monthly spend for LLM inference, fine-tuning, and ML training workloads.
Llama 3 70B VRAM
Calculate the exact GPU VRAM needed to run Meta Llama 3 70B. At FP16 it needs 140 GB — but INT4 quantization brings it down to ~35 GB, fitting on a single A100 40 GB.
Related Comparisons
Related Guides
ai-gpu
How to Run LLMs on Kubernetes: GPU Setup Guide (2026)
A practical guide to deploying GPU nodes on Kubernetes, configuring the NVIDIA device plugin, sizing VRAM for LLM inference, and running vLLM or Ollama as a scalable serving stack.
ai-gpu
GPU Cloud Providers for AI/ML in 2026: RunPod, Vast.ai, Lambda Labs, and More
A practical comparison of GPU cloud providers for AI/ML workloads in 2026 — pricing, availability, setup complexity, and when to self-host instead.
ai-gpu
How Much VRAM Do You Need to Run LLMs? A Practical Guide
Calculate exactly how much GPU memory you need to run Llama 3, Mistral, Gemma, and other LLMs at different quantization levels. FP16, INT4, GGUF explained.