etcd Kubernetes Backup Guide: Snapshots, CronJobs, and Restore

etcd is the most critical component in a Kubernetes cluster — and the one most teams only think about after a disaster. Every object in your cluster (Deployments, Secrets, ConfigMaps, RBAC bindings, CRDs, custom resources) exists as a key-value entry in etcd. If etcd is lost and you have no backup, your cluster configuration is gone. The workloads might still be running, but you can't manage them, and you can't reconstruct the state without manual work.

This guide covers what etcd actually does, what failure looks like, how to back it up, and how to restore from a snapshot.

What etcd Does in Kubernetes

etcd is a distributed key-value store that implements the Raft consensus algorithm. Kubernetes uses it as the backing store for the API server — every kubectl apply, kubectl create, and state change writes to etcd. Every kubectl get reads from etcd (or the API server's watch cache, which is backed by etcd).

The data stored in etcd includes:

Data Type	Example
Workload state	Deployment specs, ReplicaSet desired replicas, Pod specs
Configuration	ConfigMaps, Secrets
Service discovery	Services, Endpoints, EndpointSlices
Access control	RBAC Roles, ClusterRoles, Bindings, ServiceAccounts
Networking	NetworkPolicies, Ingresses
Storage	PersistentVolumes, PersistentVolumeClaims, StorageClasses
Custom resources	Any CRD instances (ArgoCD Applications, Cert-manager Certificates, etc.)
Cluster metadata	Namespaces, ResourceQuotas, LimitRanges, Nodes

What etcd does not store: the actual container images, application data, persistent volume contents, or in-memory state of your running pods. Restoring etcd restores cluster configuration and desired state, not application data.

What Happens When etcd Fails

A healthy Kubernetes cluster requires a quorum of etcd members. In a 3-member cluster, 2 must be healthy. In a 5-member cluster, 3 must be healthy.

When quorum is lost:

›The API server enters a read-only state — kubectl get still works, but writes fail
›No new pods can be scheduled
›Deployments cannot be updated
›HPA cannot scale workloads
›Service account tokens cannot be issued
›Ingress and certificate renewals fail

Running pods continue running because the kubelet caches their spec locally — but you cannot change anything until etcd recovers. This is why etcd failure is a Severity 1 incident even if all your pods are technically still up.

etcd Health Checks

# Check etcd cluster health (run on a control plane node)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health

# Check cluster member list
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  member list

# Check endpoint status (shows leader, raft index, db size)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

Backup Strategies

Manual Snapshot

The most direct approach — run etcdctl snapshot save and store the result somewhere durable.

ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
etcdctl snapshot status /tmp/etcd-snapshot-*.db --write-out=table

Take a manual snapshot before any cluster upgrade or major configuration change.

Automated CronJob

Manual snapshots are not a backup strategy — they're a one-off. For production clusters, you need an automated CronJob that:

›Takes a snapshot from a local etcd endpoint
›Compresses it
›Ships it to object storage (S3, Hetzner Object Storage, GCS)
›Retains snapshots for a configurable period and prunes old ones

Use the etcd Backup CronJob Generator to generate a complete manifest. The generated CronJob runs as a pod with access to the etcd PKI certificates (via hostPath or a Secret), takes a snapshot, and uploads to your configured object storage bucket.

A minimal example of the core backup logic in the CronJob container:

#!/bin/bash
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="/tmp/etcd-snapshot-${TIMESTAMP}.db"
COMPRESSED_FILE="${SNAPSHOT_FILE}.gz"

# Take snapshot
etcdctl snapshot save "${SNAPSHOT_FILE}" \
  --endpoints="${ETCD_ENDPOINT}" \
  --cacert="${ETCD_CACERT}" \
  --cert="${ETCD_CERT}" \
  --key="${ETCD_KEY}"

# Verify before uploading
etcdctl snapshot status "${SNAPSHOT_FILE}"

# Compress
gzip "${SNAPSHOT_FILE}"

# Upload to S3-compatible storage
aws s3 cp "${COMPRESSED_FILE}" \
  "s3://${BACKUP_BUCKET}/etcd/${TIMESTAMP}.db.gz" \
  --storage-class STANDARD_IA

# Prune snapshots older than 30 days
aws s3 ls "s3://${BACKUP_BUCKET}/etcd/" | \
  awk '{print $4}' | \
  while read key; do
    # date comparison logic to delete old snapshots
    ...
  done

echo "Backup complete: ${TIMESTAMP}.db.gz"

Schedule this CronJob to run every 6 hours. Daily backups give you a 24-hour recovery point objective (RPO) at minimum — for most clusters, 6-hour backups are a reasonable balance between storage cost and RPO.

Restore Procedure

Restoring from an etcd snapshot is a disruptive operation. All API server traffic stops during the restore. Plan for 5–15 minutes of downtime.

Step 1: Stop the API Server and etcd

On all control plane nodes, move the static pod manifests out of /etc/kubernetes/manifests/:

mkdir -p /tmp/k8s-manifests-backup
mv /etc/kubernetes/manifests/etcd.yaml /tmp/k8s-manifests-backup/
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/k8s-manifests-backup/

Wait for the pods to stop:

crictl pods | grep -E 'etcd|apiserver'
# Should return empty after 10-15 seconds

Step 2: Restore the Snapshot

# Download the snapshot from object storage
aws s3 cp s3://my-etcd-backups/etcd/20260529-060000.db.gz /tmp/
gunzip /tmp/20260529-060000.db.gz

# Restore to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /tmp/20260529-060000.db \
  --name etcd-cp1 \
  --initial-cluster "etcd-cp1=https://10.0.1.10:2380" \
  --initial-cluster-token etcd-cluster-restored \
  --initial-advertise-peer-urls https://10.0.1.10:2380 \
  --data-dir /var/lib/etcd-restored

# Back up the old data directory
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd

For a 3-node HA cluster, run the restore on all three nodes simultaneously with the correct --name and --initial-cluster flags for each, pointing to the same snapshot file.

Step 3: Restart Components

mv /tmp/k8s-manifests-backup/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/k8s-manifests-backup/kube-apiserver.yaml /etc/kubernetes/manifests/

# Verify etcd starts
crictl pods | grep etcd
kubectl get nodes  # Should return nodes after 30-60 seconds

Sizing etcd Correctly

etcd performance degrades with large databases. The default etcd quota is 2 GB — when the database reaches this size, etcd enters a maintenance mode that blocks all writes until you compact the history and defragment.

Estimate your etcd database size with the etcd Backup Size Estimator. As a rough guide:

Cluster Size	Typical etcd DB Size
50 nodes, 500 pods	50–200 MB
200 nodes, 2000 pods	200 MB – 1 GB
500+ nodes, 10,000+ pods	1–4 GB
Heavy CRD usage (many custom resources)	Add 500 MB – 2 GB

For large clusters, increase the quota and enable automatic compaction:

# In the etcd static pod manifest or kubeadm config
- --quota-backend-bytes=8589934592    # 8 GB
- --auto-compaction-retention=1       # Compact hourly
- --auto-compaction-mode=periodic

Also run periodic defragmentation to reclaim disk space after compaction:

ETCDCTL_API=3 etcdctl defrag \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

etcd Hardware Requirements

etcd is sensitive to disk latency. The etcd documentation recommends SSDs with a sequential write latency under 10 ms. Consumer-grade spinning disks are not suitable for production etcd.

Resource	Minimum	Recommended
CPU	2 cores	4 cores
RAM	4 GB	8 GB
Disk	SSD, 20 GB	NVMe SSD, 50 GB
Disk latency	< 10 ms write	< 1 ms write
Network	1 Gbps	1 Gbps (low latency between CP nodes)

For self-managed clusters, co-locate etcd on the control plane nodes — but use a dedicated NVMe volume for the etcd data directory rather than the root disk. This isolates etcd I/O from OS and log writes.

Treat etcd backups as a non-negotiable. The etcd Backup CronJob Generator removes the operational burden of setting this up correctly.

etcd: What It Is, Why It Matters, and How to Back It Up