Running large language models on Kubernetes gives you autoscaling, multi-tenant scheduling, and a reproducible deployment pipeline — but GPU setup has enough sharp edges that most teams waste days on avoidable config issues. This guide is the reference I wish I'd had.
Before picking instance types, use the AI Model VRAM Calculator to determine exactly how much VRAM your model requires at your chosen quantization level. Then use the GPU Hosting Cost Calculator to compare providers.
Prerequisites
- ›Kubernetes cluster (1.29+)
- ›Nodes with NVIDIA GPUs (see sizing section below)
- ›NVIDIA drivers installed on the host OS (not in-container)
- ›
containerdas the container runtime (Docker is deprecated in K8s)
Step 1: Install the NVIDIA Container Toolkit on GPU Nodes
The container runtime needs to know how to mount GPU devices. Run this on every GPU node:
# Ubuntu 22.04 / 24.04
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Configure containerd to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
Verify it worked:
sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v2 \
docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 gpu-test nvidia-smi
Step 2: Deploy the NVIDIA Device Plugin
The device plugin is what exposes nvidia.com/gpu as a schedulable resource in Kubernetes:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
Or via Helm for production (recommended — gives you config options):
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade --install nvdp nvdp/nvidia-device-plugin \
--namespace kube-system \
--set failOnInitError=false \
--set deviceListStrategy=envvar \
--set deviceIDStrategy=uuid
Verify GPUs are visible to the scheduler:
kubectl get nodes -o json | jq '.items[].status.capacity | with_entries(select(.key | startswith("nvidia")))'
You should see "nvidia.com/gpu": "1" (or however many GPUs are on the node).
Step 3: Label GPU Nodes
Label nodes by GPU type so you can target specific hardware with node selectors:
kubectl label node gpu-node-01 nvidia.com/gpu-product=A100-SXM4-80GB
kubectl label node gpu-node-02 nvidia.com/gpu-product=RTX-4090
The NVIDIA GPU Operator (optional, but worth it for managed fleets) can auto-populate these labels from the GPU hardware itself.
Step 4: Size Your VRAM for the Model
VRAM is the hard constraint. A model that doesn't fit in VRAM will either fail to load or thrash to system RAM, making inference unusably slow.
Rough VRAM requirements by model size (FP16, no quantization):
| Model Parameters | VRAM Required | Example GPU |
|---|---|---|
| 7B | ~14 GB | RTX 3090 (24 GB), A100 40 GB |
| 13B | ~26 GB | A100 40 GB, 2x RTX 3090 |
| 34B | ~68 GB | A100 80 GB, 2x A100 40 GB |
| 70B | ~140 GB | 2x A100 80 GB, 4x A100 40 GB |
| 405B | ~810 GB | 10x A100 80 GB |
With INT4 quantization (AWQ/GPTQ), divide by roughly 4. With INT8, divide by 2. Always leave 10–15% headroom for KV cache and activation memory.
Use the AI Model VRAM Calculator for precise estimates with different quantization schemes and batch sizes.
Step 5: Deploy vLLM for High-Throughput Inference
vLLM is the standard for production LLM serving. It implements PagedAttention for efficient KV cache management, giving 5–24x higher throughput than naive HuggingFace inference.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-70b
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-70b
template:
metadata:
labels:
app: vllm-llama3-70b
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu-product: A100-SXM4-80GB
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.4
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- meta-llama/Meta-Llama-3-70B-Instruct
- --tensor-parallel-size
- "2"
- --max-model-len
- "8192"
- --gpu-memory-utilization
- "0.90"
- --port
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "2"
memory: "128Gi"
requests:
nvidia.com/gpu: "2"
cpu: "8"
memory: "64Gi"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-70b
namespace: inference
spec:
selector:
app: vllm-llama3-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Key parameters to tune:
- ›
--tensor-parallel-size: number of GPUs to shard the model across (must matchnvidia.com/gpulimit) - ›
--gpu-memory-utilization: fraction of VRAM to use for KV cache (0.90 is a good default) - ›
--max-model-len: max context window — longer contexts consume more KV cache VRAM
Step 6: Deploy Ollama for Smaller Models / Dev Use
Ollama is simpler to operate than vLLM and great for smaller models (7B–13B) or development environments where you want easy model switching:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
nvidia.com/gpu: "1"
cpu: "4"
memory: "16Gi"
env:
- name: OLLAMA_MODELS
value: /models
volumeMounts:
- name: ollama-models
mountPath: /models
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
Pull a model after deployment:
kubectl exec -it deploy/ollama -n inference -- ollama pull llama3.2:latest
Step 7: Add a GPU Taint to Prevent Non-GPU Workloads
GPU nodes are expensive. Prevent non-GPU pods from being scheduled on them:
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule
Only pods with the matching toleration (shown in the Deployment manifests above) will be scheduled on GPU nodes. This is critical for cost control — without it, small API pods and DaemonSets will land on your $3/hr GPU nodes.
Step 8: Handle GPU Scheduling for Multi-Tenant Clusters
If multiple teams share GPU nodes, use namespace-level ResourceQuota to control allocation:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-ml
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
This prevents any one team from monopolizing all available GPUs.
Common Pitfalls
GPU not visible after device plugin install: Check that the NVIDIA driver is loaded on the host with nvidia-smi. The device plugin reads from the driver, not from containers.
OOMKilled despite having GPU VRAM: GPU VRAM and system RAM are separate. Your pod can run out of system RAM (for tokenization, data preprocessing, Python overhead) even if GPU memory is fine. Always set memory requests/limits in addition to GPU limits.
Slow first inference: Model weights are loaded from disk on first request. Pre-warm by sending a test request after the pod starts, or use an init container that pre-fetches the model weights into the PVC.
Tensor parallel across nodes: vLLM's tensor parallelism works within a node (shared NVLink/PCIe). For cross-node parallelism, you need pipeline parallelism (--pipeline-parallel-size) and high-bandwidth networking (InfiniBand or RoCE). Most Kubernetes setups don't have this — verify before planning a multi-node inference setup.
Cost Comparison: GPU Hosting Options
| Provider | GPU | VRAM | $/hr | Best For |
|---|---|---|---|---|
| RunPod | A100 80GB | 80 GB | ~$1.89 | Burst inference |
| Lambda Labs | A100 80GB | 80 GB | ~$1.99 | Training runs |
| Hetzner GPU | RTX 4000 Ada | 20 GB | ~$0.80 | Small models |
| AWS (p4d.24xl) | 8x A100 40GB | 320 GB | ~$32.77 | Large production |
| Self-hosted RTX 4090 | RTX 4090 | 24 GB | ~$0.05 (amortized) | Dev/small prod |
See RunPod vs Lambda Labs for a detailed comparison, and the GPU Hosting Cost Calculator to model your monthly spend at different request volumes.