Velero: Complete Kubernetes Backup and Disaster Recovery Guide
Velero (formerly Heptio Ark) is an open-source tool for backing up, restoring, and migrating Kubernetes cluster resources and persistent volumes. Maintained by VMware and the CNCF community, Velero is the de facto standard for Kubernetes backup and disaster recovery. This comprehensive guide covers Velero architecture, deployment, and production best practices.
What is Velero?
Velero provides disaster recovery, data migration, and data protection for Kubernetes clusters:
Key Features
- Cluster Backup: Back up all Kubernetes resources (namespaces, deployments, services, etc.)
- Volume Snapshots: Native CSI snapshot integration and Restic/Kopia file-level backup
- Scheduled Backups: Cron-based backup schedules with retention policies
- Selective Backup: Filter by namespace, label, resource type, or include/exclude patterns
- Cross-Cluster Migration: Migrate workloads between clusters
- Disaster Recovery: Restore entire clusters or individual namespaces
- Plugin Architecture: Extensible with storage and snapshot provider plugins
Velero vs. Other Kubernetes Backup Solutions
| Feature | Velero | Kasten K10 | Stash | Longhorn | Trilio |
|---|---|---|---|---|---|
| Cost | Free/OSS | Freemium | Free/OSS | Free/OSS | Commercial |
| K8s Resources | Yes | Yes | Partial | No | Yes |
| Volume Backup | CSI + Restic/Kopia | CSI + Kanister | Restic | Built-in | CSI |
| Scheduled Backups | Yes | Yes | Yes | Yes | Yes |
| Cross-Cluster | Yes | Yes | Yes | No | Yes |
| UI | CLI only | Web UI | CLI | Longhorn UI | Web UI |
| Multi-Cloud | Yes | Yes | Yes | No | Yes |
| App Consistency | Hooks | Blueprint | Hooks | Snapshot | Hooks |
| CNCF Project | Sandbox | No | No | Sandbox | No |
| License | Apache 2.0 | Proprietary | Apache 2.0 | Apache 2.0 | Proprietary |
Architecture
┌──────────────────────────────────────────────────────────────┐│ Kubernetes Cluster ││ ││ ┌──────────────┐ ┌──────────────────────────────────┐ ││ │ Velero CLI │──>│ Velero Server (Deployment) │ ││ │ (kubectl │ │ │ ││ │ plugin) │ │ ┌────────────┐ ┌─────────────┐ │ ││ └──────────────┘ │ │ Backup │ │ Restore │ │ ││ │ │ Controller │ │ Controller │ │ ││ ┌──────────────┐ │ └────────────┘ └─────────────┘ │ ││ │ CRDs │ │ ┌────────────┐ ┌─────────────┐ │ ││ │ - Backup │ │ │ Schedule │ │ Restic/ │ │ ││ │ - Restore │<──│ │ Controller │ │ Kopia Node │ │ ││ │ - Schedule │ │ └────────────┘ │ Agent (DS) │ │ ││ │ - BSL │ │ └─────────────┘ │ ││ │ - VSL │ └──────────────────────────────────┘ ││ └──────────────┘ │└──────────────────────────────────────────────────────────────┘ │ │ ▼ ▼┌──────────────────┐ ┌──────────────────────┐│ Backup Storage │ │ Volume Snapshot ││ Location (BSL) │ │ Location (VSL) ││ │ │ ││ - AWS S3 │ │ - AWS EBS Snapshots ││ - GCS │ │ - GCE PD Snapshots ││ - Azure Blob │ │ - Azure Disk Snapshots ││ - MinIO │ │ - CSI Snapshots ││ - S3-compatible │ │ │└──────────────────┘ └──────────────────────┘Core Concepts
| Concept | Description |
|---|---|
| Backup | Point-in-time snapshot of cluster resources and volumes |
| Restore | Recreate resources and volumes from a backup |
| Schedule | Cron-based recurring backup configuration |
| BSL | Backup Storage Location — where backup data is stored |
| VSL | Volume Snapshot Location — where volume snapshots are stored |
| Restic/Kopia | File-level volume backup for non-snapshot-capable storage |
| Hooks | Pre/post backup/restore commands for application consistency |
Installation
Velero CLI
# macOSbrew install velero
# Linuxwget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gztar -xzf velero-v1.13.0-linux-amd64.tar.gzsudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
# Verify installationvelero version --client-onlyAWS S3 Backend
# Create S3 bucketaws s3api create-bucket \ --bucket velero-backups-production \ --region eu-west-1 \ --create-bucket-configuration LocationConstraint=eu-west-1
# Create IAM policycat > velero-policy.json <<'EOF'{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeVolumes", "ec2:DescribeSnapshots", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateSnapshot", "ec2:DeleteSnapshot" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:DeleteObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": "arn:aws:s3:::velero-backups-production/*" }, { "Effect": "Allow", "Action": "s3:ListBucket", "Resource": "arn:aws:s3:::velero-backups-production" } ]}EOF
aws iam create-policy \ --policy-name VeleroBackupPolicy \ --policy-document file://velero-policy.json
# Create credentials filecat > credentials-velero <<EOF[default]aws_access_key_id=<YOUR_ACCESS_KEY>aws_secret_access_key=<YOUR_SECRET_KEY>EOF
# Install Velerovelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.9.0 \ --bucket velero-backups-production \ --backup-location-config region=eu-west-1 \ --snapshot-location-config region=eu-west-1 \ --secret-file ./credentials-velero \ --use-node-agent \ --default-volumes-to-fs-backupHelm Installation
configuration: backupStorageLocation: - name: default provider: aws bucket: velero-backups-production config: region: eu-west-1 volumeSnapshotLocation: - name: default provider: aws config: region: eu-west-1 defaultVolumesToFsBackup: true
credentials: useSecret: true secretContents: cloud: | [default] aws_access_key_id=<ACCESS_KEY> aws_secret_access_key=<SECRET_KEY>
initContainers: - name: velero-plugin-for-aws image: velero/velero-plugin-for-aws:v1.9.0 volumeMounts: - mountPath: /target name: plugins
deployNodeAgent: true
schedules: daily-full: disabled: false schedule: "0 2 * * *" template: ttl: 720h includedNamespaces: - "*" excludedNamespaces: - kube-system - velero snapshotVolumes: true
resources: requests: cpu: 500m memory: 256Mi limits: cpu: 1000m memory: 512Mihelm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-chartshelm install velero vmware-tanzu/velero \ --namespace velero \ --create-namespace \ -f values.yamlMinIO Backend (On-Premises)
apiVersion: apps/v1kind: Deploymentmetadata: name: minio namespace: velerospec: replicas: 1 selector: matchLabels: app: minio template: metadata: labels: app: minio spec: containers: - name: minio image: minio/minio:latest args: ["server", "/data", "--console-address", ":9001"] env: - name: MINIO_ROOT_USER value: "minioadmin" - name: MINIO_ROOT_PASSWORD valueFrom: secretKeyRef: name: minio-credentials key: password ports: - containerPort: 9000 - containerPort: 9001 volumeMounts: - name: storage mountPath: /data volumes: - name: storage persistentVolumeClaim: claimName: minio-pvc# Install Velero with MinIOvelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.9.0 \ --bucket velero \ --secret-file ./credentials-velero \ --use-node-agent \ --backup-location-config \ region=minio,s3ForcePathStyle=true,s3Url=http://minio.velero.svc:9000 \ --snapshot-location-config region=minioBackup Operations
Full Cluster Backup
# Back up entire clustervelero backup create full-cluster-backup \ --exclude-namespaces velero,kube-system
# Back up with volume snapshotsvelero backup create full-with-volumes \ --snapshot-volumes \ --exclude-namespaces velero
# Back up with file-system backup (Restic/Kopia)velero backup create full-fs-backup \ --default-volumes-to-fs-backupNamespace Backup
# Back up specific namespacesvelero backup create production-backup \ --include-namespaces production,staging
# Back up with label selectorvelero backup create app-backup \ --selector app=my-app
# Back up specific resource typesvelero backup create configs-backup \ --include-resources configmaps,secrets,deployments \ --include-namespaces productionBackup with Hooks
apiVersion: apps/v1kind: Deploymentmetadata: name: postgres annotations: backup.velero.io/backup-volumes: pgdata pre.hook.backup.velero.io/container: postgres pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "pg_dump -U postgres mydb > /var/lib/postgresql/data/backup.sql"]' pre.hook.backup.velero.io/timeout: 120s post.hook.backup.velero.io/container: postgres post.hook.backup.velero.io/command: '["/bin/bash", "-c", "rm /var/lib/postgresql/data/backup.sql"]'spec: template: spec: containers: - name: postgres image: postgres:16 volumeMounts: - name: pgdata mountPath: /var/lib/postgresql/data volumes: - name: pgdata persistentVolumeClaim: claimName: postgres-pvcScheduled Backups
# Daily backup at 2 AM with 30-day retentionvelero schedule create daily-backup \ --schedule="0 2 * * *" \ --ttl 720h \ --exclude-namespaces velero,kube-system
# Hourly backup for critical namespaces with 48h retentionvelero schedule create hourly-critical \ --schedule="0 * * * *" \ --ttl 48h \ --include-namespaces production
# Weekly full backup with 90-day retentionvelero schedule create weekly-full \ --schedule="0 3 * * 0" \ --ttl 2160h \ --snapshot-volumes
# List schedulesvelero schedule get
# Pause a schedulevelero schedule pause daily-backup
# Resume a schedulevelero schedule unpause daily-backupRestore Operations
Full Restore
# Restore from the latest backupvelero restore create --from-backup full-cluster-backup
# Restore specific namespacesvelero restore create --from-backup full-cluster-backup \ --include-namespaces production
# Restore with namespace mapping (rename)velero restore create --from-backup production-backup \ --namespace-mappings production:production-restored
# Restore only specific resourcesvelero restore create --from-backup full-cluster-backup \ --include-resources deployments,services,configmaps
# Dry run (preview what will be restored)velero restore create --from-backup full-cluster-backup \ --dry-runRestore Hooks
apiVersion: velero.io/v1kind: Restoremetadata: name: production-restore namespace: velerospec: backupName: production-backup hooks: resources: - name: postgres-restore includedNamespaces: - production labelSelector: matchLabels: app: postgres postHooks: - init: initContainers: - name: restore-check image: postgres:16 command: - /bin/bash - -c - "pg_isready -h localhost -U postgres" - exec: container: postgres command: - /bin/bash - -c - "psql -U postgres -c 'VACUUM ANALYZE;'" waitTimeout: 5m execTimeout: 2m onError: ContinueCross-Cluster Migration
# Source cluster: create backupvelero backup create migration-backup \ --include-namespaces app-namespace \ --snapshot-volumes=false \ --default-volumes-to-fs-backup
# Target cluster: configure same BSLvelero backup-location create source-backups \ --provider aws \ --bucket velero-backups-production \ --config region=eu-west-1
# Target cluster: restore from source backupvelero restore create --from-backup migration-backup \ --namespace-mappings app-namespace:app-namespace
# Verify migrationkubectl get all -n app-namespaceMonitoring
Prometheus Metrics
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: velero namespace: velerospec: selector: matchLabels: app.kubernetes.io/name: velero endpoints: - port: monitoring interval: 30sKey Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
velero_backup_success_total | Successful backup count | Decreasing |
velero_backup_failure_total | Failed backup count | > 0 |
velero_backup_duration_seconds | Backup duration | > baseline * 2 |
velero_backup_items_total | Items backed up | Significant decrease |
velero_restore_success_total | Successful restore count | — |
velero_restore_failure_total | Failed restore count | > 0 |
velero_backup_last_successful_timestamp | Last successful backup | > schedule interval |
Alerting Rules
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: velero-alerts namespace: velerospec: groups: - name: velero rules: - alert: VeleroBackupFailed expr: increase(velero_backup_failure_total[1h]) > 0 for: 0m labels: severity: critical annotations: summary: "Velero backup failed"
- alert: VeleroBackupStale expr: time() - velero_backup_last_successful_timestamp{schedule!=""} > 86400 * 2 for: 0m labels: severity: warning annotations: summary: "No successful backup in 48 hours"Troubleshooting
# Check backup statusvelero backup describe full-cluster-backup --details
# Check backup logsvelero backup logs full-cluster-backup
# Check restore statusvelero restore describe production-restore --details
# Check restore logsvelero restore logs production-restore
# Check Velero server logskubectl logs -n velero deployment/velero
# Check node agent logs (Restic/Kopia)kubectl logs -n velero daemonset/node-agent
# Verify BSL connectivityvelero backup-location get
# Delete stuck backupvelero backup delete stuck-backup --confirmProduction Best Practices
Checklist
- Use scheduled backups with appropriate TTL retention
- Configure both BSL (object storage) and VSL (volume snapshots)
- Enable file-system backup (Restic/Kopia) as fallback for non-CSI volumes
- Use backup hooks for application-consistent backups (database dumps)
- Store backups in a different region/account for disaster recovery
- Monitor backup success/failure with Prometheus alerts
- Test restore procedures regularly (at least monthly)
- Document RTO and RPO for each namespace/application
- Encrypt backups at rest (S3 SSE, KMS)
- Use RBAC to restrict Velero access to cluster admins
- Exclude non-essential namespaces (kube-system, monitoring)
- Version control Velero Helm values and schedule configurations
- Set resource requests/limits on Velero server and node agent
- Use immutable backup storage (S3 Object Lock) for compliance
Transform Your Team’s Kubernetes Backup Skills
Implementing reliable backup and disaster recovery for Kubernetes requires understanding of storage systems, application consistency, and operational procedures. At chavkov.com, I deliver hands-on Velero and Kubernetes DR training that prepares your team for production operations.
Training Options
| Format | Duration | Focus |
|---|---|---|
| Velero Fundamentals | 1 day | Installation, backup, restore, schedules |
| Kubernetes DR | 2 days | Velero + cross-cluster migration + DR planning |
| Platform Operations | 3 days | Full backup strategy with monitoring and automation |
All trainings include hands-on labs with real Kubernetes clusters. Contact me to discuss your team’s needs.