Skip to content
Vladimir Chavkov
Go back

Velero: Complete Kubernetes Backup and Disaster Recovery Guide

Edit page

Velero: Complete Kubernetes Backup and Disaster Recovery Guide

Velero (formerly Heptio Ark) is an open-source tool for backing up, restoring, and migrating Kubernetes cluster resources and persistent volumes. Maintained by VMware and the CNCF community, Velero is the de facto standard for Kubernetes backup and disaster recovery. This comprehensive guide covers Velero architecture, deployment, and production best practices.

What is Velero?

Velero provides disaster recovery, data migration, and data protection for Kubernetes clusters:

Key Features

  1. Cluster Backup: Back up all Kubernetes resources (namespaces, deployments, services, etc.)
  2. Volume Snapshots: Native CSI snapshot integration and Restic/Kopia file-level backup
  3. Scheduled Backups: Cron-based backup schedules with retention policies
  4. Selective Backup: Filter by namespace, label, resource type, or include/exclude patterns
  5. Cross-Cluster Migration: Migrate workloads between clusters
  6. Disaster Recovery: Restore entire clusters or individual namespaces
  7. Plugin Architecture: Extensible with storage and snapshot provider plugins

Velero vs. Other Kubernetes Backup Solutions

FeatureVeleroKasten K10StashLonghornTrilio
CostFree/OSSFreemiumFree/OSSFree/OSSCommercial
K8s ResourcesYesYesPartialNoYes
Volume BackupCSI + Restic/KopiaCSI + KanisterResticBuilt-inCSI
Scheduled BackupsYesYesYesYesYes
Cross-ClusterYesYesYesNoYes
UICLI onlyWeb UICLILonghorn UIWeb UI
Multi-CloudYesYesYesNoYes
App ConsistencyHooksBlueprintHooksSnapshotHooks
CNCF ProjectSandboxNoNoSandboxNo
LicenseApache 2.0ProprietaryApache 2.0Apache 2.0Proprietary

Architecture

┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────────────────────────┐ │
│ │ Velero CLI │──>│ Velero Server (Deployment) │ │
│ │ (kubectl │ │ │ │
│ │ plugin) │ │ ┌────────────┐ ┌─────────────┐ │ │
│ └──────────────┘ │ │ Backup │ │ Restore │ │ │
│ │ │ Controller │ │ Controller │ │ │
│ ┌──────────────┐ │ └────────────┘ └─────────────┘ │ │
│ │ CRDs │ │ ┌────────────┐ ┌─────────────┐ │ │
│ │ - Backup │ │ │ Schedule │ │ Restic/ │ │ │
│ │ - Restore │<──│ │ Controller │ │ Kopia Node │ │ │
│ │ - Schedule │ │ └────────────┘ │ Agent (DS) │ │ │
│ │ - BSL │ │ └─────────────┘ │ │
│ │ - VSL │ └──────────────────────────────────┘ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Backup Storage │ │ Volume Snapshot │
│ Location (BSL) │ │ Location (VSL) │
│ │ │ │
│ - AWS S3 │ │ - AWS EBS Snapshots │
│ - GCS │ │ - GCE PD Snapshots │
│ - Azure Blob │ │ - Azure Disk Snapshots │
│ - MinIO │ │ - CSI Snapshots │
│ - S3-compatible │ │ │
└──────────────────┘ └──────────────────────┘

Core Concepts

ConceptDescription
BackupPoint-in-time snapshot of cluster resources and volumes
RestoreRecreate resources and volumes from a backup
ScheduleCron-based recurring backup configuration
BSLBackup Storage Location — where backup data is stored
VSLVolume Snapshot Location — where volume snapshots are stored
Restic/KopiaFile-level volume backup for non-snapshot-capable storage
HooksPre/post backup/restore commands for application consistency

Installation

Velero CLI

Terminal window
# macOS
brew install velero
# Linux
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
tar -xzf velero-v1.13.0-linux-amd64.tar.gz
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
# Verify installation
velero version --client-only

AWS S3 Backend

Terminal window
# Create S3 bucket
aws s3api create-bucket \
--bucket velero-backups-production \
--region eu-west-1 \
--create-bucket-configuration LocationConstraint=eu-west-1
# Create IAM policy
cat > velero-policy.json <<'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3:::velero-backups-production/*"
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::velero-backups-production"
}
]
}
EOF
aws iam create-policy \
--policy-name VeleroBackupPolicy \
--policy-document file://velero-policy.json
# Create credentials file
cat > credentials-velero <<EOF
[default]
aws_access_key_id=<YOUR_ACCESS_KEY>
aws_secret_access_key=<YOUR_SECRET_KEY>
EOF
# Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups-production \
--backup-location-config region=eu-west-1 \
--snapshot-location-config region=eu-west-1 \
--secret-file ./credentials-velero \
--use-node-agent \
--default-volumes-to-fs-backup

Helm Installation

values.yaml
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero-backups-production
config:
region: eu-west-1
volumeSnapshotLocation:
- name: default
provider: aws
config:
region: eu-west-1
defaultVolumesToFsBackup: true
credentials:
useSecret: true
secretContents:
cloud: |
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.9.0
volumeMounts:
- mountPath: /target
name: plugins
deployNodeAgent: true
schedules:
daily-full:
disabled: false
schedule: "0 2 * * *"
template:
ttl: 720h
includedNamespaces:
- "*"
excludedNamespaces:
- kube-system
- velero
snapshotVolumes: true
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
Terminal window
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
-f values.yaml

MinIO Backend (On-Premises)

minio-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
namespace: velero
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:latest
args: ["server", "/data", "--console-address", ":9001"]
env:
- name: MINIO_ROOT_USER
value: "minioadmin"
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-credentials
key: password
ports:
- containerPort: 9000
- containerPort: 9001
volumeMounts:
- name: storage
mountPath: /data
volumes:
- name: storage
persistentVolumeClaim:
claimName: minio-pvc
Terminal window
# Install Velero with MinIO
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero \
--secret-file ./credentials-velero \
--use-node-agent \
--backup-location-config \
region=minio,s3ForcePathStyle=true,s3Url=http://minio.velero.svc:9000 \
--snapshot-location-config region=minio

Backup Operations

Full Cluster Backup

Terminal window
# Back up entire cluster
velero backup create full-cluster-backup \
--exclude-namespaces velero,kube-system
# Back up with volume snapshots
velero backup create full-with-volumes \
--snapshot-volumes \
--exclude-namespaces velero
# Back up with file-system backup (Restic/Kopia)
velero backup create full-fs-backup \
--default-volumes-to-fs-backup

Namespace Backup

Terminal window
# Back up specific namespaces
velero backup create production-backup \
--include-namespaces production,staging
# Back up with label selector
velero backup create app-backup \
--selector app=my-app
# Back up specific resource types
velero backup create configs-backup \
--include-resources configmaps,secrets,deployments \
--include-namespaces production

Backup with Hooks

deployment-with-hooks.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
annotations:
backup.velero.io/backup-volumes: pgdata
pre.hook.backup.velero.io/container: postgres
pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_dump -U postgres mydb > /var/lib/postgresql/data/backup.sql"]'
pre.hook.backup.velero.io/timeout: 120s
post.hook.backup.velero.io/container: postgres
post.hook.backup.velero.io/command: '["/bin/bash", "-c", "rm /var/lib/postgresql/data/backup.sql"]'
spec:
template:
spec:
containers:
- name: postgres
image: postgres:16
volumeMounts:
- name: pgdata
mountPath: /var/lib/postgresql/data
volumes:
- name: pgdata
persistentVolumeClaim:
claimName: postgres-pvc

Scheduled Backups

Terminal window
# Daily backup at 2 AM with 30-day retention
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--ttl 720h \
--exclude-namespaces velero,kube-system
# Hourly backup for critical namespaces with 48h retention
velero schedule create hourly-critical \
--schedule="0 * * * *" \
--ttl 48h \
--include-namespaces production
# Weekly full backup with 90-day retention
velero schedule create weekly-full \
--schedule="0 3 * * 0" \
--ttl 2160h \
--snapshot-volumes
# List schedules
velero schedule get
# Pause a schedule
velero schedule pause daily-backup
# Resume a schedule
velero schedule unpause daily-backup

Restore Operations

Full Restore

Terminal window
# Restore from the latest backup
velero restore create --from-backup full-cluster-backup
# Restore specific namespaces
velero restore create --from-backup full-cluster-backup \
--include-namespaces production
# Restore with namespace mapping (rename)
velero restore create --from-backup production-backup \
--namespace-mappings production:production-restored
# Restore only specific resources
velero restore create --from-backup full-cluster-backup \
--include-resources deployments,services,configmaps
# Dry run (preview what will be restored)
velero restore create --from-backup full-cluster-backup \
--dry-run

Restore Hooks

restore-hooks.yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: production-restore
namespace: velero
spec:
backupName: production-backup
hooks:
resources:
- name: postgres-restore
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
postHooks:
- init:
initContainers:
- name: restore-check
image: postgres:16
command:
- /bin/bash
- -c
- "pg_isready -h localhost -U postgres"
- exec:
container: postgres
command:
- /bin/bash
- -c
- "psql -U postgres -c 'VACUUM ANALYZE;'"
waitTimeout: 5m
execTimeout: 2m
onError: Continue

Cross-Cluster Migration

Terminal window
# Source cluster: create backup
velero backup create migration-backup \
--include-namespaces app-namespace \
--snapshot-volumes=false \
--default-volumes-to-fs-backup
# Target cluster: configure same BSL
velero backup-location create source-backups \
--provider aws \
--bucket velero-backups-production \
--config region=eu-west-1
# Target cluster: restore from source backup
velero restore create --from-backup migration-backup \
--namespace-mappings app-namespace:app-namespace
# Verify migration
kubectl get all -n app-namespace

Monitoring

Prometheus Metrics

servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: velero
namespace: velero
spec:
selector:
matchLabels:
app.kubernetes.io/name: velero
endpoints:
- port: monitoring
interval: 30s

Key Metrics

MetricDescriptionAlert Threshold
velero_backup_success_totalSuccessful backup countDecreasing
velero_backup_failure_totalFailed backup count> 0
velero_backup_duration_secondsBackup duration> baseline * 2
velero_backup_items_totalItems backed upSignificant decrease
velero_restore_success_totalSuccessful restore count
velero_restore_failure_totalFailed restore count> 0
velero_backup_last_successful_timestampLast successful backup> schedule interval

Alerting Rules

velero-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: velero-alerts
namespace: velero
spec:
groups:
- name: velero
rules:
- alert: VeleroBackupFailed
expr: increase(velero_backup_failure_total[1h]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Velero backup failed"
- alert: VeleroBackupStale
expr: time() - velero_backup_last_successful_timestamp{schedule!=""} > 86400 * 2
for: 0m
labels:
severity: warning
annotations:
summary: "No successful backup in 48 hours"

Troubleshooting

Terminal window
# Check backup status
velero backup describe full-cluster-backup --details
# Check backup logs
velero backup logs full-cluster-backup
# Check restore status
velero restore describe production-restore --details
# Check restore logs
velero restore logs production-restore
# Check Velero server logs
kubectl logs -n velero deployment/velero
# Check node agent logs (Restic/Kopia)
kubectl logs -n velero daemonset/node-agent
# Verify BSL connectivity
velero backup-location get
# Delete stuck backup
velero backup delete stuck-backup --confirm

Production Best Practices

Checklist


Transform Your Team’s Kubernetes Backup Skills

Implementing reliable backup and disaster recovery for Kubernetes requires understanding of storage systems, application consistency, and operational procedures. At chavkov.com, I deliver hands-on Velero and Kubernetes DR training that prepares your team for production operations.

Training Options

FormatDurationFocus
Velero Fundamentals1 dayInstallation, backup, restore, schedules
Kubernetes DR2 daysVelero + cross-cluster migration + DR planning
Platform Operations3 daysFull backup strategy with monitoring and automation

All trainings include hands-on labs with real Kubernetes clusters. Contact me to discuss your team’s needs.


Edit page
Share this post on:

Previous Post
AWS Cost Optimization Strategies: Complete Guide
Next Post
Python Data Engineering: Building Production Pipelines with Apache Airflow and dbt