Skip to main content
Version: 2.5

Hard Disk Replacement

Remove a Failed Dard Disk

Storage Status

Scenario:

If we find a failed hard disk, it is critical to remove it from your cluster and restore the health of your storage pool.

To proceed:

  • Check the storage status first.
  • As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.93k objects, 785 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean

io:
client: 5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 65.7G 380G 6 40.0k 7 161k exists,up
1 sky141 181G 265G 8 32.7k 1 58.4k exists,up
2 sky141 162G 283G 0 4096 15 604k exists,up
3 sky141 133G 313G 0 1638 2 29.6k exists,up
4 sky141 91.8G 354G 14 97.5k 6 39.2k exists
5 sky141 130G 315G 8 39.1k 3 88.9k exists
6 sky142 96.0G 350G 9 50.3k 3 160k exists,up
7 sky142 165G 281G 0 0 1 89.6k exists,up
8 sky142 75.8G 370G 0 6553 1 25.6k exists,up
9 sky142 199G 247G 0 3276 3 172k exists,up
10 sky142 122G 324G 2 13.5k 9 510k exists,up
11 sky142 95.3G 351G 1 4096 6 126k exists,up
12 sky143 184G 262G 3 12.0k 1 25.6k exists,up
13 sky143 93.6G 353G 0 0 0 5734 exists,up
14 sky143 67.8G 378G 12 71.1k 13 364k exists,up
15 sky143 92.6G 354G 0 819 0 0 exists,up
16 sky143 142G 303G 0 819 2 24.0k exists,up
17 sky143 179G 267G 0 2457 5 99.2k exists,up

Remove the Disk

  • Connect to the host sky141
  • Use CLI remove_disk, and find out /dev/sde is associated with ID 4,5 on index 3
  • Then, we remove /dev/sde from the ceph pool
sky141:storage> remove_disk
index name size osd serial
--
1 /dev/sda 894.3G 0 1 S40FNA0M800607
2 /dev/sdc 894.3G 2 3 S40FNA0M800598
3 /dev/sde 894.3G 4 5 S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YES

Check the Status Again

Let's check the status of our storage pool, Cepth should be recovering the data automatically.

sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.94k objects, 785 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 6075/438706 objects degraded (1.385%)
5463/438706 objects misplaced (1.245%)
738 active+clean
8 active+undersized+degraded+remapped+backfilling
7 active+remapped+backfilling

io:
client: 4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
recovery: 127 MiB/s, 27 objects/s

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 141G 305G 1 28.0k 1 42.4k exists,up
1 sky141 177G 268G 11 88.0k 3 26.3k exists,up
2 sky141 212G 233G 2 12.7k 0 0 exists,up
3 sky141 193G 253G 3 31.1k 7 634k exists,up
6 sky142 86.0G 360G 9 40.0k 2 27.1k exists,up
7 sky142 179G 267G 7 184k 2 119k exists,up
8 sky142 90.8G 355G 0 18.3k 19 1553k exists,up
9 sky142 201G 245G 8 35.1k 16 1450k exists,up
10 sky142 108G 337G 6 51.1k 11 755k exists,up
11 sky142 98.5G 348G 0 6553 2 41.6k exists,up
12 sky143 201G 245G 16 100k 3 230k exists,up
13 sky143 122G 323G 0 0 0 0 exists,up
14 sky143 88.0G 358G 15 76.0k 47 2970k exists,up
15 sky143 100G 346G 7 183k 14 1286k exists,up
16 sky143 127G 319G 5 28.0k 15 659k exists,up
17 sky143 132G 314G 23 225k 9 491k exists,up

Results:

Wait for a while and check the status again.

We had successfully removed the failed hard disk and the health status was back to OK.

sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 753 active+clean

io:
client: 25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 148G 298G 0 0 0 0 exists,up
1 sky141 176G 269G 5 23.1k 1 0 exists,up
2 sky141 202G 243G 0 28.0k 1 0 exists,up
3 sky141 220G 225G 0 3276 0 0 exists,up
6 sky142 86.1G 360G 4 20.7k 0 0 exists,up
7 sky142 180G 266G 0 0 0 0 exists,up
8 sky142 89.2G 357G 7 49.5k 2 10.3k exists,up
9 sky142 201G 245G 0 819 0 0 exists,up
10 sky142 108G 337G 1 7372 0 5734 exists,up
11 sky142 99.1G 347G 0 12.7k 0 0 exists,up
12 sky143 199G 247G 1 5734 1 0 exists,up
13 sky143 112G 333G 4 22.3k 0 0 exists,up
14 sky143 86.3G 360G 1 18.3k 2 90 exists,up
15 sky143 98.7G 347G 0 16.0k 1 0 exists,up
16 sky143 128G 318G 1 4915 2 9027 exists,up
17 sky143 141G 305G 2 22.3k 0 0 exists,up

If the disk was originally encrypted, not much should be done for data security concerns.

If the disk was added with the raw mode, it is recommended to shred the disk using the command shred.

# With the number of iterations default to 3
sudo shred -vfz [disk]

# With a custom number of iterations
sudo shred -vfz -n [num_of_iteration] [disk]

For more detailed options, please consult man.

Although we would not need to shred an encrypted disk after the removal, we would need to keep in mind while using the encryption, we would lose some degree of CPU and IO efficiency.


Add a New Disk

Scenario: We had a new hard disk to add to our storage pool.

Connect to the host, which is added with the new hard disk.

Add the add new disk to the node with CLI storage > add_disk.

sky141:storage> add_disk
index name size serial
--
1 /dev/sde 894.3G S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.

Wait for a moment, auto recovery should be started.

sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 277/438826 objects degraded (0.063%)
82685/438826 objects misplaced (18.842%)
660 active+clean
51 active+remapped+backfilling
39 active+remapped+backfill_wait
2 active+remapped
1 active+undersized+degraded+remapped+backfilling

io:
client: 727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
recovery: 873 MiB/s, 2 keys/s, 165 objects/s

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 149G 297G 0 6553 0 0 exists,up
1 sky141 175G 271G 0 4096 1 18.6k exists,up
2 sky141 202G 244G 0 66.3k 3 34.4k exists,up
3 sky141 221G 225G 4 16.0k 1 17.0k exists,up
4 sky141 6397M 440G 0 0 0 0 exists,up
5 sky141 3304M 443G 0 0 0 72 exists,up
6 sky142 86.3G 360G 3 15.4k 3 60.5k exists,up
7 sky142 180G 266G 0 585 1 8192 exists,up
8 sky142 89.2G 357G 0 23.1k 5 128k exists,up
9 sky142 201G 245G 11 70.3k 3 46.4k exists,up
10 sky142 109G 337G 1 17.1k 1 28.0k exists,up
11 sky142 99.1G 347G 0 0 0 0 exists,up
12 sky143 200G 245G 3 16.0k 4 193k exists,up
13 sky143 114G 332G 0 4915 0 0 exists,up
14 sky143 86.3G 360G 5 41.0k 12 303k exists,up
15 sky143 99.3G 347G 2 89.3k 5 129k exists,up
16 sky143 128G 317G 0 2457 1 16 exists,up
17 sky143 143G 303G 13 84.0k 1 7372 exists,up

Result: OSD size is changed from 16 to 18.

sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 150.06k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean

io:
client: 1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 64.9G 381G 3 30.3k 0 0 exists,up
1 sky141 180G 266G 8 40.7k 0 0 exists,up
2 sky141 161G 285G 0 8192 0 0 exists,up
3 sky141 131G 314G 3 16.0k 0 0 exists,up
4 sky141 91.1G 355G 1 7372 1 0 exists,up
5 sky141 130G 316G 0 0 1 90 exists,up
6 sky142 96.3G 350G 5 23.1k 0 0 exists,up
7 sky142 165G 281G 0 7372 0 0 exists,up
8 sky142 76.1G 370G 0 0 1 0 exists,up
9 sky142 199G 247G 0 6553 0 0 exists,up
10 sky142 122G 324G 2 9011 0 0 exists,up
11 sky142 95.5G 351G 2 21.5k 0 0 exists,up
12 sky143 184G 262G 2 35.1k 1 0 exists,up
13 sky143 95.7G 350G 0 0 0 0 exists,up
14 sky143 66.4G 380G 9 44.0k 1 0 exists,up
15 sky143 92.9G 353G 1 9011 0 0 exists,up
16 sky143 143G 303G 0 6553 1 16 exists,up
17 sky143 179G 267G 7 52.0k 1 102 exists,up

Remove an OSD

Scenario: One of the nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, we would need to take action from any active host in the cluster.

Connect to one of the live hosts.

Start removing the OSDs by using the command storage > remove_osd to remove all failed OSDs from the list.

Verify the OSDs are removed with the storage > status command.

s2:storage> remove_osd
Enter osd id to be removed:
1: down (1.81920)
2: down (1.81920)
3: osd.31 (hdd)
4: osd.35 (hdd)
5: osd.36 (hdd)
6: osd.38 (hdd)
7: osd.41 (hdd)
8: osd.42 (hdd)
9: osd.44 (hdd)
10: osd.46 (hdd)
11: osd.48 (hdd)
12: osd.50 (hdd)
13: osd.52 (hdd)
14: osd.54 (hdd)
15: osd.64 (hdd)
16: osd.65 (hdd)
17: osd.66 (hdd)
18: osd.67 (hdd)
19: osd.68 (hdd)
20: osd.69 (hdd)
21: osd.70 (hdd)
22: osd.71 (hdd)
23: osd.72 (hdd)
24: osd.73 (hdd)
25: osd.74 (hdd)
26: osd.75 (hdd)
27: osd.76 (hdd)
28: osd.77 (hdd)
29: osd.78 (hdd)
30: osd.79 (hdd)
31: osd.21 (ssd)
32: osd.23 (ssd)
33: osd.25 (ssd)
34: osd.27 (ssd)
35: osd.60 (ssd)
36: osd.61 (ssd)
37: osd.62 (ssd)
38: osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.