Skip to main content
Version: 3.0

Hard Disk Replacement

Remove a Failed Dard Disk​

Storage Status​

Scenario:

If we find a failed hard disk, it is critical to remove it from your cluster and restore the health of your storage pool.

To proceed:

  • Check the storage status first.
  • As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard diskβ€”OSD numbers 4 and 5 on the node with the hostname cc1.
cc1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

services:
mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
mgr: cc1(active, since 8d), standbys: cc2, cc3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.93k objects, 785 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean

io:
client: 5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 cc1 65.7G 380G 6 40.0k 7 161k exists,up
1 cc1 181G 265G 8 32.7k 1 58.4k exists,up
2 cc1 162G 283G 0 4096 15 604k exists,up
3 cc1 133G 313G 0 1638 2 29.6k exists,up
4 cc1 91.8G 354G 14 97.5k 6 39.2k exists
5 cc1 130G 315G 8 39.1k 3 88.9k exists
6 cc2 96.0G 350G 9 50.3k 3 160k exists,up
7 cc2 165G 281G 0 0 1 89.6k exists,up
8 cc2 75.8G 370G 0 6553 1 25.6k exists,up
9 cc2 199G 247G 0 3276 3 172k exists,up
10 cc2 122G 324G 2 13.5k 9 510k exists,up
11 cc2 95.3G 351G 1 4096 6 126k exists,up
12 cc3 184G 262G 3 12.0k 1 25.6k exists,up
13 cc3 93.6G 353G 0 0 0 5734 exists,up
14 cc3 67.8G 378G 12 71.1k 13 364k exists,up
15 cc3 92.6G 354G 0 819 0 0 exists,up
16 cc3 142G 303G 0 819 2 24.0k exists,up
17 cc3 179G 267G 0 2457 5 99.2k exists,up

Remove the Disk​

  • Connect to the host cc1
  • Use CLI remove_disk, and find out /dev/sde is associated with ID 4,5 on index 3
  • Then, we remove /dev/sde from the ceph pool
cc1:storage> remove_disk
index name size osd serial
--
1 /dev/sda 894.3G 0 1 S40FNA0M800607
2 /dev/sdc 894.3G 2 3 S40FNA0M800598
3 /dev/sde 894.3G 4 5 S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YES

Check the Status Again​

Let's check the status of our storage pool, Cepth should be recovering the data automatically.

cc1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

services:
mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
mgr: cc1(active, since 8d), standbys: cc2, cc3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.94k objects, 785 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 6075/438706 objects degraded (1.385%)
5463/438706 objects misplaced (1.245%)
738 active+clean
8 active+undersized+degraded+remapped+backfilling
7 active+remapped+backfilling

io:
client: 4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
recovery: 127 MiB/s, 27 objects/s

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 cc1 141G 305G 1 28.0k 1 42.4k exists,up
1 cc1 177G 268G 11 88.0k 3 26.3k exists,up
2 cc1 212G 233G 2 12.7k 0 0 exists,up
3 cc1 193G 253G 3 31.1k 7 634k exists,up
6 cc2 86.0G 360G 9 40.0k 2 27.1k exists,up
7 cc2 179G 267G 7 184k 2 119k exists,up
8 cc2 90.8G 355G 0 18.3k 19 1553k exists,up
9 cc2 201G 245G 8 35.1k 16 1450k exists,up
10 cc2 108G 337G 6 51.1k 11 755k exists,up
11 cc2 98.5G 348G 0 6553 2 41.6k exists,up
12 cc3 201G 245G 16 100k 3 230k exists,up
13 cc3 122G 323G 0 0 0 0 exists,up
14 cc3 88.0G 358G 15 76.0k 47 2970k exists,up
15 cc3 100G 346G 7 183k 14 1286k exists,up
16 cc3 127G 319G 5 28.0k 15 659k exists,up
17 cc3 132G 314G 23 225k 9 491k exists,up

Results:​

Wait for a while and check the status again.

We had successfully removed the failed hard disk and the health status was back to OK.

cc1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
mgr: cc1(active, since 8d), standbys: cc2, cc3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 753 active+clean

io:
client: 25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 cc1 148G 298G 0 0 0 0 exists,up
1 cc1 176G 269G 5 23.1k 1 0 exists,up
2 cc1 202G 243G 0 28.0k 1 0 exists,up
3 cc1 220G 225G 0 3276 0 0 exists,up
6 cc2 86.1G 360G 4 20.7k 0 0 exists,up
7 cc2 180G 266G 0 0 0 0 exists,up
8 cc2 89.2G 357G 7 49.5k 2 10.3k exists,up
9 cc2 201G 245G 0 819 0 0 exists,up
10 cc2 108G 337G 1 7372 0 5734 exists,up
11 cc2 99.1G 347G 0 12.7k 0 0 exists,up
12 cc3 199G 247G 1 5734 1 0 exists,up
13 cc3 112G 333G 4 22.3k 0 0 exists,up
14 cc3 86.3G 360G 1 18.3k 2 90 exists,up
15 cc3 98.7G 347G 0 16.0k 1 0 exists,up
16 cc3 128G 318G 1 4915 2 9027 exists,up
17 cc3 141G 305G 2 22.3k 0 0 exists,up

If the disk was originally encrypted, not much should be done for data security concerns.

If the disk was added with the raw mode, it is recommended to shred the disk using the command shred.

# With the number of iterations default to 3
sudo shred -vfz [disk]

# With a custom number of iterations
sudo shred -vfz -n [num_of_iteration] [disk]

For more detailed options, please consult man.

Although we would not need to shred an encrypted disk after the removal, we would need to keep in mind while using the encryption, we would lose some degree of CPU and IO efficiency.


Add a New Disk​

Scenario: We had a new hard disk to add to our storage pool.

Connect to the host, which is added with the new hard disk.

Add the add new disk to the node with CLI storage > add_disk.

cc1:storage> add_disk
index name size serial
--
1 /dev/sde 894.3G S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.

Wait for a moment, auto recovery should be started.

cc1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

services:
mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
mgr: cc1(active, since 8d), standbys: cc2, cc3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 277/438826 objects degraded (0.063%)
82685/438826 objects misplaced (18.842%)
660 active+clean
51 active+remapped+backfilling
39 active+remapped+backfill_wait
2 active+remapped
1 active+undersized+degraded+remapped+backfilling

io:
client: 727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
recovery: 873 MiB/s, 2 keys/s, 165 objects/s

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 cc1 149G 297G 0 6553 0 0 exists,up
1 cc1 175G 271G 0 4096 1 18.6k exists,up
2 cc1 202G 244G 0 66.3k 3 34.4k exists,up
3 cc1 221G 225G 4 16.0k 1 17.0k exists,up
4 cc1 6397M 440G 0 0 0 0 exists,up
5 cc1 3304M 443G 0 0 0 72 exists,up
6 cc2 86.3G 360G 3 15.4k 3 60.5k exists,up
7 cc2 180G 266G 0 585 1 8192 exists,up
8 cc2 89.2G 357G 0 23.1k 5 128k exists,up
9 cc2 201G 245G 11 70.3k 3 46.4k exists,up
10 cc2 109G 337G 1 17.1k 1 28.0k exists,up
11 cc2 99.1G 347G 0 0 0 0 exists,up
12 cc3 200G 245G 3 16.0k 4 193k exists,up
13 cc3 114G 332G 0 4915 0 0 exists,up
14 cc3 86.3G 360G 5 41.0k 12 303k exists,up
15 cc3 99.3G 347G 2 89.3k 5 129k exists,up
16 cc3 128G 317G 0 2457 1 16 exists,up
17 cc3 143G 303G 13 84.0k 1 7372 exists,up

Result: OSD size is changed from 16 to 18.

cc1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
mgr: cc1(active, since 8d), standbys: cc2, cc3
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 150.06k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean

io:
client: 1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 cc1 64.9G 381G 3 30.3k 0 0 exists,up
1 cc1 180G 266G 8 40.7k 0 0 exists,up
2 cc1 161G 285G 0 8192 0 0 exists,up
3 cc1 131G 314G 3 16.0k 0 0 exists,up
4 cc1 91.1G 355G 1 7372 1 0 exists,up
5 cc1 130G 316G 0 0 1 90 exists,up
6 cc2 96.3G 350G 5 23.1k 0 0 exists,up
7 cc2 165G 281G 0 7372 0 0 exists,up
8 cc2 76.1G 370G 0 0 1 0 exists,up
9 cc2 199G 247G 0 6553 0 0 exists,up
10 cc2 122G 324G 2 9011 0 0 exists,up
11 cc2 95.5G 351G 2 21.5k 0 0 exists,up
12 cc3 184G 262G 2 35.1k 1 0 exists,up
13 cc3 95.7G 350G 0 0 0 0 exists,up
14 cc3 66.4G 380G 9 44.0k 1 0 exists,up
15 cc3 92.9G 353G 1 9011 0 0 exists,up
16 cc3 143G 303G 0 6553 1 16 exists,up
17 cc3 179G 267G 7 52.0k 1 102 exists,up

Remove an OSD​

Scenario: One of the nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, we would need to take action from any active host in the cluster.

Connect to one of the live hosts.

Start removing the OSDs by using the command storage > remove_osd to remove all failed OSDs from the list.

Verify the OSDs are removed with the storage > status command.

s2:storage> remove_osd
Enter osd id to be removed:
1: down (1.81920)
2: down (1.81920)
3: osd.31 (hdd)
4: osd.35 (hdd)
5: osd.36 (hdd)
6: osd.38 (hdd)
7: osd.41 (hdd)
8: osd.42 (hdd)
9: osd.44 (hdd)
10: osd.46 (hdd)
11: osd.48 (hdd)
12: osd.50 (hdd)
13: osd.52 (hdd)
14: osd.54 (hdd)
15: osd.64 (hdd)
16: osd.65 (hdd)
17: osd.66 (hdd)
18: osd.67 (hdd)
19: osd.68 (hdd)
20: osd.69 (hdd)
21: osd.70 (hdd)
22: osd.71 (hdd)
23: osd.72 (hdd)
24: osd.73 (hdd)
25: osd.74 (hdd)
26: osd.75 (hdd)
27: osd.76 (hdd)
28: osd.77 (hdd)
29: osd.78 (hdd)
30: osd.79 (hdd)
31: osd.21 (ssd)
32: osd.23 (ssd)
33: osd.25 (ssd)
34: osd.27 (ssd)
35: osd.60 (ssd)
36: osd.61 (ssd)
37: osd.62 (ssd)
38: osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.