Hard disk replacement
Remove a failed hard disk
Storage status
Story: If you discover a failed hard disk, it is essential to remove it from your cluster and restore the health of your storage pool. To proceed:
- Check the storage status before starting:
- As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded
services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.93k objects, 785 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean
io:
client: 5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 65.7G 380G 6 40.0k 7 161k exists,up
1 sky141 181G 265G 8 32.7k 1 58.4k exists,up
2 sky141 162G 283G 0 4096 15 604k exists,up
3 sky141 133G 313G 0 1638 2 29.6k exists,up
4 sky141 91.8G 354G 14 97.5k 6 39.2k exists
5 sky141 130G 315G 8 39.1k 3 88.9k exists
6 sky142 96.0G 350G 9 50.3k 3 160k exists,up
7 sky142 165G 281G 0 0 1 89.6k exists,up
8 sky142 75.8G 370G 0 6553 1 25.6k exists,up
9 sky142 199G 247G 0 3276 3 172k exists,up
10 sky142 122G 324G 2 13.5k 9 510k exists,up
11 sky142 95.3G 351G 1 4096 6 126k exists,up
12 sky143 184G 262G 3 12.0k 1 25.6k exists,up
13 sky143 93.6G 353G 0 0 0 5734 exists,up
14 sky143 67.8G 378G 12 71.1k 13 364k exists,up
15 sky143 92.6G 354G 0 819 0 0 exists,up
16 sky143 142G 303G 0 819 2 24.0k exists,up
17 sky143 179G 267G 0 2457 5 99.2k exists,up
Remove disk
connect to the host sky141
use CLI
remove_disk
it show/dev/sde
is associated with id 4,5 on index 3.then we remove
/dev/sde
from the ceph pool.Remove the Hard disk from the nodes
sky141:storage> remove_disk
index name size osd serial
--
1 /dev/sda 894.3G 0 1 S40FNA0M800607
2 /dev/sdc 894.3G 2 3 S40FNA0M800598
3 /dev/sde 894.3G 4 5 S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YESlet's check the status of our storage pool, ceph is recovering the data automatically
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized
services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.94k objects, 785 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 6075/438706 objects degraded (1.385%)
5463/438706 objects misplaced (1.245%)
738 active+clean
8 active+undersized+degraded+remapped+backfilling
7 active+remapped+backfilling
io:
client: 4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
recovery: 127 MiB/s, 27 objects/s
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 141G 305G 1 28.0k 1 42.4k exists,up
1 sky141 177G 268G 11 88.0k 3 26.3k exists,up
2 sky141 212G 233G 2 12.7k 0 0 exists,up
3 sky141 193G 253G 3 31.1k 7 634k exists,up
6 sky142 86.0G 360G 9 40.0k 2 27.1k exists,up
7 sky142 179G 267G 7 184k 2 119k exists,up
8 sky142 90.8G 355G 0 18.3k 19 1553k exists,up
9 sky142 201G 245G 8 35.1k 16 1450k exists,up
10 sky142 108G 337G 6 51.1k 11 755k exists,up
11 sky142 98.5G 348G 0 6553 2 41.6k exists,up
12 sky143 201G 245G 16 100k 3 230k exists,up
13 sky143 122G 323G 0 0 0 0 exists,up
14 sky143 88.0G 358G 15 76.0k 47 2970k exists,up
15 sky143 100G 346G 7 183k 14 1286k exists,up
16 sky143 127G 319G 5 28.0k 15 659k exists,up
17 sky143 132G 314G 23 225k 9 491k exists,up
Results:
wait for a while and check the
status
againWe had successfully remove the failed hard disk and the health status are OK
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK
services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
pgs: 753 active+clean
io:
client: 25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 148G 298G 0 0 0 0 exists,up
1 sky141 176G 269G 5 23.1k 1 0 exists,up
2 sky141 202G 243G 0 28.0k 1 0 exists,up
3 sky141 220G 225G 0 3276 0 0 exists,up
6 sky142 86.1G 360G 4 20.7k 0 0 exists,up
7 sky142 180G 266G 0 0 0 0 exists,up
8 sky142 89.2G 357G 7 49.5k 2 10.3k exists,up
9 sky142 201G 245G 0 819 0 0 exists,up
10 sky142 108G 337G 1 7372 0 5734 exists,up
11 sky142 99.1G 347G 0 12.7k 0 0 exists,up
12 sky143 199G 247G 1 5734 1 0 exists,up
13 sky143 112G 333G 4 22.3k 0 0 exists,up
14 sky143 86.3G 360G 1 18.3k 2 90 exists,up
15 sky143 98.7G 347G 0 16.0k 1 0 exists,up
16 sky143 128G 318G 1 4915 2 9027 exists,up
17 sky143 141G 305G 2 22.3k 0 0 exists,up
Add a new disk
Story: We had bought a new hard disk to add in to our storage pool
connect to the host (which add the new hard disk)
Adding add new disk to the node with CLI
add_disk
sky141:storage> add_disk
index name size serial
--
1 /dev/sde 894.3G S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.wait for a moment, auto recovery is started
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded
services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.99k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 277/438826 objects degraded (0.063%)
82685/438826 objects misplaced (18.842%)
660 active+clean
51 active+remapped+backfilling
39 active+remapped+backfill_wait
2 active+remapped
1 active+undersized+degraded+remapped+backfilling
io:
client: 727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
recovery: 873 MiB/s, 2 keys/s, 165 objects/s
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 149G 297G 0 6553 0 0 exists,up
1 sky141 175G 271G 0 4096 1 18.6k exists,up
2 sky141 202G 244G 0 66.3k 3 34.4k exists,up
3 sky141 221G 225G 4 16.0k 1 17.0k exists,up
4 sky141 6397M 440G 0 0 0 0 exists,up
5 sky141 3304M 443G 0 0 0 72 exists,up
6 sky142 86.3G 360G 3 15.4k 3 60.5k exists,up
7 sky142 180G 266G 0 585 1 8192 exists,up
8 sky142 89.2G 357G 0 23.1k 5 128k exists,up
9 sky142 201G 245G 11 70.3k 3 46.4k exists,up
10 sky142 109G 337G 1 17.1k 1 28.0k exists,up
11 sky142 99.1G 347G 0 0 0 0 exists,up
12 sky143 200G 245G 3 16.0k 4 193k exists,up
13 sky143 114G 332G 0 4915 0 0 exists,up
14 sky143 86.3G 360G 5 41.0k 12 303k exists,up
15 sky143 99.3G 347G 2 89.3k 5 129k exists,up
16 sky143 128G 317G 0 2457 1 16 exists,up
17 sky143 143G 303G 13 84.0k 1 7372 exists,upResult: osd size is changed from 16 to 18
sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK
services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 150.06k objects, 786 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean
io:
client: 1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 64.9G 381G 3 30.3k 0 0 exists,up
1 sky141 180G 266G 8 40.7k 0 0 exists,up
2 sky141 161G 285G 0 8192 0 0 exists,up
3 sky141 131G 314G 3 16.0k 0 0 exists,up
4 sky141 91.1G 355G 1 7372 1 0 exists,up
5 sky141 130G 316G 0 0 1 90 exists,up
6 sky142 96.3G 350G 5 23.1k 0 0 exists,up
7 sky142 165G 281G 0 7372 0 0 exists,up
8 sky142 76.1G 370G 0 0 1 0 exists,up
9 sky142 199G 247G 0 6553 0 0 exists,up
10 sky142 122G 324G 2 9011 0 0 exists,up
11 sky142 95.5G 351G 2 21.5k 0 0 exists,up
12 sky143 184G 262G 2 35.1k 1 0 exists,up
13 sky143 95.7G 350G 0 0 0 0 exists,up
14 sky143 66.4G 380G 9 44.0k 1 0 exists,up
15 sky143 92.9G 353G 1 9011 0 0 exists,up
16 sky143 143G 303G 0 6553 1 16 exists,up
17 sky143 179G 267G 7 52.0k 1 102 exists,up
Remove an osd
Story: One of your nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, you need to take action from any active host in your cluster.
- Connect to one of your live hosts.
- Start removing the OSDs by Using the command
storage > remove_osd
to remove all failed OSDs from the list. - Verify the OSDs are removed with the
storage > status
command.s2:storage> remove_osd
Enter osd id to be removed:
1: down (1.81920)
2: down (1.81920)
3: osd.31 (hdd)
4: osd.35 (hdd)
5: osd.36 (hdd)
6: osd.38 (hdd)
7: osd.41 (hdd)
8: osd.42 (hdd)
9: osd.44 (hdd)
10: osd.46 (hdd)
11: osd.48 (hdd)
12: osd.50 (hdd)
13: osd.52 (hdd)
14: osd.54 (hdd)
15: osd.64 (hdd)
16: osd.65 (hdd)
17: osd.66 (hdd)
18: osd.67 (hdd)
19: osd.68 (hdd)
20: osd.69 (hdd)
21: osd.70 (hdd)
22: osd.71 (hdd)
23: osd.72 (hdd)
24: osd.73 (hdd)
25: osd.74 (hdd)
26: osd.75 (hdd)
27: osd.76 (hdd)
28: osd.77 (hdd)
29: osd.78 (hdd)
30: osd.79 (hdd)
31: osd.21 (ssd)
32: osd.23 (ssd)
33: osd.25 (ssd)
34: osd.27 (ssd)
35: osd.60 (ssd)
36: osd.61 (ssd)
37: osd.62 (ssd)
38: osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.