Ceph Distributed Storage: Complete Production Deployment Guide
Ceph is a highly scalable, distributed storage system that provides object, block, and file storage in a unified system. Designed for performance, reliability, and scalability, Ceph is used by organizations worldwide for petabyte-scale deployments. This comprehensive guide covers Ceph architecture, deployment, and production best practices.
What is Ceph?
Ceph is an open-source, software-defined storage platform that delivers:
Key Features
- Unified Storage: Object (RGW), Block (RBD), and File (CephFS) storage
- Scalability: Scale from gigabytes to exabytes
- No Single Point of Failure: Fully distributed architecture
- Self-Healing: Automatic data replication and recovery
- Performance: Parallel access to data across cluster
- CRUSH Algorithm: Intelligent data distribution
Ceph vs. Other Storage Solutions
| Feature | Ceph | GlusterFS | MinIO | Traditional SAN |
|---|---|---|---|---|
| Object Storage | ✅ RGW | ❌ No | ✅ Native | ❌ No |
| Block Storage | ✅ RBD | ❌ No | ❌ No | ✅ Yes |
| File Storage | ✅ CephFS | ✅ Yes | ❌ No | ✅ NFS/SMB |
| Scalability | Petabytes+ | Petabytes | Exabytes | Limited |
| Self-Healing | ✅ Yes | ✅ Yes | ⚠️ Erasure coding | ❌ Manual |
| Cost | Hardware only | Hardware only | Hardware only | High (HW+SW) |
Architecture
Ceph Components
┌──────────────────────────────────────────────────────────────┐│ Client Layer ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ RBD │ │ RGW │ │ CephFS │ ││ │ (Block) │ │ (Object) │ │ (File) │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ │└───────┼───────────────┼───────────────┼───────────────────────┘ │ │ │ └───────────────┴───────────────┘ │ ┌───────────────┴───────────────┐ ▼ ▼┌──────────────┐ ┌──────────────┐│ librados │ │ CRUSH Map ││ (C, Python, │◄───────────────►│ (Data ││ Java, etc) │ │ Placement) │└──────┬───────┘ └──────────────┘ │ ▼┌──────────────────────────────────────────────────────────────┐│ RADOS (Storage Layer) ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ MON │◄──►│ MON │◄──►│ MON │ ││ │(Monitor) │ │(Monitor) │ │(Monitor) │ ││ └──────────┘ └──────────┘ └──────────┘ ││ ▲ ▲ ▲ ││ └───────────────┴───────────────┘ ││ │ ││ ┌────────────────────┼────────────────────────┐ ││ │ ┌───────────────┴───────────────┐ │ ││ │ ▼ ▼ ▼ │ ││ │ ┌────┐ ┌────┐ ┌────┐ │ ││ │ │OSD1│ │OSD2│ │OSD3│ │ ││ │ │ ┌──┴──┐ │ ┌──┴──┐ │ ┌──┴──┐ │ ││ │ │ │Disk1│ │ │Disk2│ │ │Disk3│ │ ││ │ └─┴─────┘ └─┴─────┘ └─┴─────┘ │ ││ │ │ ││ │ ┌────┐ ┌────┐ ┌────┐ │ ││ │ │OSD4│ │OSD5│ │OSD6│ │ ││ │ │ ┌──┴──┐ │ ┌──┴──┐ │ ┌──┴──┐ │ ││ │ │ │Disk4│ │ │Disk5│ │ │Disk6│ │ ││ │ └─┴─────┘ └─┴─────┘ └─┴─────┘ │ ││ └────────────────────────────────────────────┘ ││ ││ ┌──────────┐ ┌──────────┐ ││ │ MGR │◄──►│ MGR │ ││ │(Manager) │ │(Manager) │ ││ └──────────┘ └──────────┘ ││ ││ ┌──────────┐ ┌──────────┐ ││ │ MDS │◄──►│ MDS │ (CephFS only) ││ │(Metadata)│ │(Metadata)│ ││ └──────────┘ └──────────┘ │└──────────────────────────────────────────────────────────────┘Key Components
- MON (Monitor): Maintains cluster maps, requires odd number (3, 5, 7)
- OSD (Object Storage Daemon): Stores data, handles replication, recovery
- MGR (Manager): Cluster monitoring, dashboard, metrics
- MDS (Metadata Server): CephFS metadata management
- RGW (RADOS Gateway): S3/Swift-compatible object storage API
- RBD (RADOS Block Device): Block device interface
Installation and Deployment
System Requirements
Minimum Test Cluster:
- 3 nodes (can combine MON+OSD)
- 4 CPU cores per node
- 8 GB RAM per node (+ 2 GB per OSD)
- 10 GB for OS + dedicated disks for OSDs
- 1 Gbps network
Production Cluster:
- 3+ dedicated MON nodes
- 3+ OSD nodes (more recommended)
- 16+ CPU cores per OSD node
- 64+ GB RAM per OSD node
- Enterprise SSDs/NVMe for OSDs
- 10/25/100 Gbps network (separate public/cluster networks)
Deployment with cephadm (Recommended)
# Install cephadm on admin nodecurl --silent --remote-name --location \ https://github.com/ceph/ceph/raw/quincy/src/cephadm/cephadm
chmod +x cephadmmkdir -p /etc/ceph
# Add Ceph repository./cephadm add-repo --release quincy./cephadm install
# Bootstrap first monitorcephadm bootstrap \ --mon-ip 10.0.1.11 \ --cluster-network 10.0.2.0/24 \ --initial-dashboard-user admin \ --initial-dashboard-password 'SecurePassword123!' \ --ssh-user root
# The bootstrap command will output the dashboard URL and credentials# Save the admin keyring and configcephadm shell -- ceph config generate-minimal-conf > /etc/ceph/ceph.confcephadm shell -- ceph auth get client.admin > /etc/ceph/ceph.client.admin.keyring
# Install ceph-common for CLI accesscephadm install ceph-common
# Verify cluster statusceph -sceph health detailAdd Nodes to Cluster
# Copy SSH key to new nodesssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-node2ssh-copy-id -f -i /etc/ceph/ceph.pub root@ceph-node3
# Add nodes to clusterceph orch host add ceph-node2 10.0.1.12ceph orch host add ceph-node3 10.0.1.13
# Label nodes for specific rolesceph orch host label add ceph-node2 monceph orch host label add ceph-node3 mon
# List hostsceph orch host ls
# Deploy additional monitorsceph orch apply mon "ceph-node1,ceph-node2,ceph-node3"
# Deploy managersceph orch apply mgr --placement="3 ceph-node1 ceph-node2 ceph-node3"Add OSDs
# List available devicesceph orch device ls
# Add all available devices as OSDsceph orch apply osd --all-available-devices
# Add specific deviceceph orch daemon add osd ceph-node2:/dev/sdb
# Add OSD with separate DB/WAL device (SSD/NVMe)ceph orch daemon add osd ceph-node2:data_devices=/dev/sdc,db_devices=/dev/nvme0n1
# Advanced OSD specificationcat > osd-spec.yaml << 'EOF'service_type: osdservice_id: default_drive_groupplacement: host_pattern: 'ceph-node*'data_devices: all: truedb_devices: paths: - /dev/nvme0n1 - /dev/nvme1n1wal_devices: paths: - /dev/nvme0n1 - /dev/nvme1n1EOF
ceph orch apply -i osd-spec.yaml
# View OSD statusceph osd treeceph osd statPool Configuration
Create Pools
# Calculate PG number (recommended: 100-200 PGs per OSD)# Formula: (Target PGs per OSD) × (OSDs) / (Replica size) = PG number# Round to nearest power of 2
# Create replicated poolceph osd pool create rbd-pool 128 128 replicated
# Create erasure-coded poolceph osd erasure-code-profile set ec-profile \ k=4 m=2 \ crush-failure-domain=host
ceph osd pool create ec-pool 128 128 erasure ec-profile
# Set pool applicationceph osd pool application enable rbd-pool rbdceph osd pool application enable ec-pool rgw
# Pool with custom CRUSH ruleceph osd crush rule create-replicated ssd-rule default host ssdceph osd pool set rbd-pool crush_rule ssd-rule
# Configure pool parametersceph osd pool set rbd-pool size 3 # Replica countceph osd pool set rbd-pool min_size 2 # Min replicas for I/Oceph osd pool set rbd-pool pg_autoscale_mode on
# Pool quotasceph osd pool set-quota rbd-pool max_bytes $((10 * 1024**4)) # 10 TBceph osd pool set-quota rbd-pool max_objects 1000000
# List poolsceph osd pool ls detailRBD (Block Storage)
Create and Use RBD Images
# Create RBD imagerbd create rbd-pool/disk1 --size 100G
# Create with featuresrbd create rbd-pool/disk2 \ --size 500G \ --image-feature layering,exclusive-lock,object-map,fast-diff
# List imagesrbd ls rbd-poolrbd info rbd-pool/disk1
# Resize imagerbd resize rbd-pool/disk1 --size 200G
# Create snapshotrbd snap create rbd-pool/disk1@snap1
# List snapshotsrbd snap ls rbd-pool/disk1
# Clone from snapshotrbd snap protect rbd-pool/disk1@snap1rbd clone rbd-pool/disk1@snap1 rbd-pool/disk1-clone
# Map RBD devicerbd map rbd-pool/disk1# Creates /dev/rbd0
# Format and mountmkfs.ext4 /dev/rbd0mount /dev/rbd0 /mnt/ceph-disk
# Unmapumount /mnt/ceph-diskrbd unmap /dev/rbd0
# Delete imagerbd rm rbd-pool/disk1RBD with Kubernetes
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ceph-rbdprovisioner: rbd.csi.ceph.comparameters: clusterID: b9127830-b0cc-4e34-aa47-9d1a2e9949a8 pool: rbd-pool imageFeatures: layering,exclusive-lock,object-map,fast-diff csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret csi.storage.k8s.io/controller-expand-secret-namespace: ceph-csi csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi csi.storage.k8s.io/fstype: ext4reclaimPolicy: DeleteallowVolumeExpansion: truemountOptions: - discard---# pvc.yamlapiVersion: v1kind: PersistentVolumeClaimmetadata: name: ceph-pvcspec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: ceph-rbdCephFS (File Storage)
Create CephFS
# Create pools for CephFSceph osd pool create cephfs_data 128ceph osd pool create cephfs_metadata 64
# Create filesystemceph fs new cephfs cephfs_metadata cephfs_data
# Verifyceph fs lsceph fs status cephfs
# Create MDS daemonsceph orch apply mds cephfs --placement="3 ceph-node1 ceph-node2 ceph-node3"
# Mount CephFS (kernel client)mount -t ceph 10.0.1.11:6789:/ /mnt/cephfs \ -o name=admin,secret=AQBsomething==
# Mount with ceph-fuse (userspace client)ceph-fuse /mnt/cephfs -n client.admin
# Persistent mount in /etc/fstab# 10.0.1.11:6789:/ /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/admin.secret,_netdev,noatime 0 2
# Create subdirectories with quotasmkdir /mnt/cephfs/project1setfattr -n ceph.quota.max_bytes -v 100000000000 /mnt/cephfs/project1 # 100 GBsetfattr -n ceph.quota.max_files -v 1000000 /mnt/cephfs/project1CephFS Subvolumes
# Create volumeceph fs volume create myfs
# Create subvolume groupceph fs subvolumegroup create myfs group1
# Create subvolumeceph fs subvolume create myfs sub1 --group_name group1 --size 10737418240 # 10 GB
# Get subvolume pathceph fs subvolume getpath myfs sub1 --group_name group1
# Create snapshotceph fs subvolume snapshot create myfs sub1 snap1 --group_name group1
# Clone subvolumeceph fs subvolume snapshot clone myfs sub1 snap1 sub1-clone --group_name group1
# Delete subvolumeceph fs subvolume rm myfs sub1 --group_name group1RGW (Object Storage)
Deploy RGW
# Deploy RGW daemonceph orch apply rgw myrgw \ --placement="2 ceph-node1 ceph-node2" \ --port=8080
# Verifyceph orch ps --daemon-type rgw
# Create RGW userradosgw-admin user create \ --uid=johndoe \ --display-name="John Doe" \ --email=john@example.com
# Output includes access_key and secret_key
# Grant admin privilegesradosgw-admin caps add \ --uid=johndoe \ --caps="users=*;buckets=*;metadata=*;usage=*;zone=*"
# Create subuser for Swiftradosgw-admin subuser create \ --uid=johndoe \ --subuser=johndoe:swift \ --access=full \ --secret=secretkey123Use RGW with S3
import boto3
# Configure S3 clients3 = boto3.client( 's3', endpoint_url='http://10.0.1.11:8080', aws_access_key_id='ACCESS_KEY', aws_secret_access_key='SECRET_KEY')
# Create buckets3.create_bucket(Bucket='my-bucket')
# Upload objects3.upload_file('local-file.txt', 'my-bucket', 'remote-file.txt')
# Download objects3.download_file('my-bucket', 'remote-file.txt', 'downloaded-file.txt')
# List objectsresponse = s3.list_objects_v2(Bucket='my-bucket')for obj in response.get('Contents', []): print(obj['Key'])
# Delete objects3.delete_object(Bucket='my-bucket', Key='remote-file.txt')Performance Tuning
OSD Tuning
# BlueStore cache (per OSD)ceph config set osd bluestore_cache_size_hdd 4294967296 # 4 GB for HDDceph config set osd bluestore_cache_size_ssd 8589934592 # 8 GB for SSD
# Thread poolceph config set osd osd_op_num_threads_per_shard 2ceph config set osd osd_op_num_shards 8
# Recovery tuningceph config set osd osd_recovery_max_active 3ceph config set osd osd_max_backfills 1
# Scrubbingceph config set osd osd_scrub_begin_hour 1ceph config set osd osd_scrub_end_hour 6ceph config set osd osd_scrub_during_recovery falseNetwork Optimization
# Separate public and cluster networks# Public network: Client traffic# Cluster network: Replication and recovery
# Configure in ceph.confcat >> /etc/ceph/ceph.conf << 'EOF'[global]public_network = 10.0.1.0/24cluster_network = 10.0.2.0/24
# Network tuningms_bind_port_min = 6800ms_bind_port_max = 7300EOF
# Apply to all OSDsceph config set osd public_network 10.0.1.0/24ceph config set osd cluster_network 10.0.2.0/24Client-Side Tuning
# RBD cacherbd config global set rbd_cache truerbd config global set rbd_cache_size 67108864 # 64 MB
# CephFS client cacheceph config set client client_cache_size 1073741824 # 1 GBMonitoring and Maintenance
Ceph Dashboard
# Dashboard is enabled by default with cephadm# Access at https://ceph-node1:8443
# Enable additional modulesceph mgr module enable prometheusceph mgr module enable diskprediction_localceph mgr module enable telemetry
# Dashboard user managementceph dashboard ac-user-create admin password administratorceph dashboard ac-user-set-roles admin administratorMonitoring Commands
# Cluster statusceph -sceph health detail
# OSD statusceph osd statusceph osd treeceph osd df
# Pool usageceph dfceph osd pool stats
# PG statusceph pg statceph pg dump
# Performance statsceph osd perfceph daemonperf osd.0
# MON statusceph mon statceph quorum_status -f json-pretty
# Check slow requestsceph daemon osd.0 dump_historic_opsPrometheus Integration
# Prometheus is enabled by default# Example prometheus.ymlcat >> prometheus.yml << 'EOF'scrape_configs: - job_name: 'ceph' static_configs: - targets: ['ceph-node1:9283', 'ceph-node2:9283']EOFBackup and Disaster Recovery
RBD Snapshots and Backups
# Create snapshotrbd snap create rbd-pool/disk1@backup-$(date +%Y%m%d)
# Export snapshotrbd export rbd-pool/disk1@backup-20260210 /backup/disk1-20260210.img
# Incremental backuprbd export-diff rbd-pool/disk1@backup-20260210 /backup/disk1-diff-20260210.img
# Import backuprbd import /backup/disk1-20260210.img rbd-pool/disk1-restored
# RBD mirroring for DRceph mgr module enable rbd_supportrbd mirror pool enable rbd-pool imagerbd mirror image enable rbd-pool/disk1 snapshotCephFS Snapshots
# Enable snapshotsceph fs set cephfs allow_new_snaps true
# Create snapshotmkdir /mnt/cephfs/.snap/backup-$(date +%Y%m%d)
# List snapshotsls /mnt/cephfs/.snap/
# Remove snapshotrmdir /mnt/cephfs/.snap/backup-20260210Production Checklist
Infrastructure
- Minimum 3 MON nodes (odd number)
- Dedicated 10+ Gbps network
- Separate public/cluster networks
- Enterprise SSDs for OSDs
- NVMe for BlueStore DB/WAL
- UPS and redundant power
Configuration
- Pools configured with proper PG count
- CRUSH rules for failure domains
- Pool quotas configured
- BlueStore tuning applied
- Network tuning configured
- Scrubbing scheduled
Monitoring
- Dashboard accessible
- Prometheus metrics enabled
- Grafana dashboards configured
- Alert rules defined
- Log aggregation setup
Security
- CephX authentication enabled
- User access controls configured
- Network firewall rules
- RGW SSL/TLS enabled
- Regular security updates
Operations
- Backup strategy defined
- DR plan documented
- Runbooks created
- On-call rotation established
- Upgrade procedure tested
Conclusion
Ceph provides enterprise-grade distributed storage with exceptional scalability and flexibility. Its unified storage approach eliminates the need for separate storage systems, while the self-healing capabilities ensure data durability and availability.
Success with Ceph requires proper hardware selection, careful capacity planning, and ongoing operational expertise. Organizations that invest in Ceph gain a powerful, open-source storage platform capable of scaling from terabytes to exabytes while maintaining high performance and reliability.
Master storage technologies including Ceph with our infrastructure training programs. Contact us for customized training designed for your team’s needs.