Running LLMs on Kubernetes with GPUs: Complete Setup Guide (2026)

Running large language models on Kubernetes gives you autoscaling, multi-tenant scheduling, and a reproducible deployment pipeline — but GPU setup has enough sharp edges that most teams waste days on avoidable config issues. This guide is the reference I wish I'd had.

Before picking instance types, use the AI Model VRAM Calculator to determine exactly how much VRAM your model requires at your chosen quantization level. Then use the GPU Hosting Cost Calculator to compare providers.

Prerequisites

›Kubernetes cluster (1.29+)
›Nodes with NVIDIA GPUs (see sizing section below)
›NVIDIA drivers installed on the host OS (not in-container)
›containerd as the container runtime (Docker is deprecated in K8s)

Step 1: Install the NVIDIA Container Toolkit on GPU Nodes

The container runtime needs to know how to mount GPU devices. Run this on every GPU node:

# Ubuntu 22.04 / 24.04
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure containerd to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

Verify it worked:

sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v2 \
  docker.io/nvidia/cuda:12.4.1-base-ubuntu22.04 gpu-test nvidia-smi

Step 2: Deploy the NVIDIA Device Plugin

The device plugin is what exposes nvidia.com/gpu as a schedulable resource in Kubernetes:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Or via Helm for production (recommended — gives you config options):

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm upgrade --install nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set failOnInitError=false \
  --set deviceListStrategy=envvar \
  --set deviceIDStrategy=uuid

Verify GPUs are visible to the scheduler:

kubectl get nodes -o json | jq '.items[].status.capacity | with_entries(select(.key | startswith("nvidia")))'

You should see "nvidia.com/gpu": "1" (or however many GPUs are on the node).

Step 3: Label GPU Nodes

Label nodes by GPU type so you can target specific hardware with node selectors:

kubectl label node gpu-node-01 nvidia.com/gpu-product=A100-SXM4-80GB
kubectl label node gpu-node-02 nvidia.com/gpu-product=RTX-4090

The NVIDIA GPU Operator (optional, but worth it for managed fleets) can auto-populate these labels from the GPU hardware itself.

Step 4: Size Your VRAM for the Model

VRAM is the hard constraint. A model that doesn't fit in VRAM will either fail to load or thrash to system RAM, making inference unusably slow.

Rough VRAM requirements by model size (FP16, no quantization):

Model Parameters	VRAM Required	Example GPU
7B	~14 GB	RTX 3090 (24 GB), A100 40 GB
13B	~26 GB	A100 40 GB, 2x RTX 3090
34B	~68 GB	A100 80 GB, 2x A100 40 GB
70B	~140 GB	2x A100 80 GB, 4x A100 40 GB
405B	~810 GB	10x A100 80 GB

With INT4 quantization (AWQ/GPTQ), divide by roughly 4. With INT8, divide by 2. Always leave 10–15% headroom for KV cache and activation memory.

Use the AI Model VRAM Calculator for precise estimates with different quantization schemes and batch sizes.

Step 5: Deploy vLLM for High-Throughput Inference

vLLM is the standard for production LLM serving. It implements PagedAttention for efficient KV cache management, giving 5–24x higher throughput than naive HuggingFace inference.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-70b
  template:
    metadata:
      labels:
        app: vllm-llama3-70b
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        nvidia.com/gpu-product: A100-SXM4-80GB
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.4
          command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --model
            - meta-llama/Meta-Llama-3-70B-Instruct
            - --tensor-parallel-size
            - "2"
            - --max-model-len
            - "8192"
            - --gpu-memory-utilization
            - "0.90"
            - --port
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "2"
              memory: "128Gi"
            requests:
              nvidia.com/gpu: "2"
              cpu: "8"
              memory: "64Gi"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-70b
  namespace: inference
spec:
  selector:
    app: vllm-llama3-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

Key parameters to tune:

›--tensor-parallel-size: number of GPUs to shard the model across (must match nvidia.com/gpu limit)
›--gpu-memory-utilization: fraction of VRAM to use for KV cache (0.90 is a good default)
›--max-model-len: max context window — longer contexts consume more KV cache VRAM

Step 6: Deploy Ollama for Smaller Models / Dev Use

Ollama is simpler to operate than vLLM and great for smaller models (7B–13B) or development environments where you want easy model switching:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "32Gi"
            requests:
              nvidia.com/gpu: "1"
              cpu: "4"
              memory: "16Gi"
          env:
            - name: OLLAMA_MODELS
              value: /models
          volumeMounts:
            - name: ollama-models
              mountPath: /models
      volumes:
        - name: ollama-models
          persistentVolumeClaim:
            claimName: ollama-models-pvc

Pull a model after deployment:

kubectl exec -it deploy/ollama -n inference -- ollama pull llama3.2:latest

Step 7: Add a GPU Taint to Prevent Non-GPU Workloads

GPU nodes are expensive. Prevent non-GPU pods from being scheduled on them:

kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule

Only pods with the matching toleration (shown in the Deployment manifests above) will be scheduled on GPU nodes. This is critical for cost control — without it, small API pods and DaemonSets will land on your $3/hr GPU nodes.

Step 8: Handle GPU Scheduling for Multi-Tenant Clusters

If multiple teams share GPU nodes, use namespace-level ResourceQuota to control allocation:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"

This prevents any one team from monopolizing all available GPUs.

Common Pitfalls

GPU not visible after device plugin install: Check that the NVIDIA driver is loaded on the host with nvidia-smi. The device plugin reads from the driver, not from containers.

OOMKilled despite having GPU VRAM: GPU VRAM and system RAM are separate. Your pod can run out of system RAM (for tokenization, data preprocessing, Python overhead) even if GPU memory is fine. Always set memory requests/limits in addition to GPU limits.

Slow first inference: Model weights are loaded from disk on first request. Pre-warm by sending a test request after the pod starts, or use an init container that pre-fetches the model weights into the PVC.

Tensor parallel across nodes: vLLM's tensor parallelism works within a node (shared NVLink/PCIe). For cross-node parallelism, you need pipeline parallelism (--pipeline-parallel-size) and high-bandwidth networking (InfiniBand or RoCE). Most Kubernetes setups don't have this — verify before planning a multi-node inference setup.

Cost Comparison: GPU Hosting Options

Provider	GPU	VRAM	$/hr	Best For
RunPod	A100 80GB	80 GB	~$1.89	Burst inference
Lambda Labs	A100 80GB	80 GB	~$1.99	Training runs
Hetzner GPU	RTX 4000 Ada	20 GB	~$0.80	Small models
AWS (p4d.24xl)	8x A100 40GB	320 GB	~$32.77	Large production
Self-hosted RTX 4090	RTX 4090	24 GB	~$0.05 (amortized)	Dev/small prod

See RunPod vs Lambda Labs for a detailed comparison, and the GPU Hosting Cost Calculator to model your monthly spend at different request volumes.

How to Run LLMs on Kubernetes: GPU Setup Guide (2026)