Hard disk replacement
#
Remove a failed hard disk#
Storage statusStory: If you discover a failed hard disk, it is essential to remove it from your cluster and restore the health of your storage pool. To proceed:
- Check the storage status before starting:
- As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.
sky141:storage> status cluster: id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85 health: HEALTH_WARN 2 osds down Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded services: mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d) mgr: sky141(active, since 8d), standbys: sky142, sky143 mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 18 osds: 16 up (since 2d), 18 in (since 2d) rgw: 3 daemons active (3 hosts, 1 zones)
data: volumes: 1/1 healthy pools: 25 pools, 753 pgs objects: 149.93k objects, 785 GiB usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail pgs: 753 active+clean
io: client: 5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 sky141 65.7G 380G 6 40.0k 7 161k exists,up 1 sky141 181G 265G 8 32.7k 1 58.4k exists,up 2 sky141 162G 283G 0 4096 15 604k exists,up 3 sky141 133G 313G 0 1638 2 29.6k exists,up 4 sky141 91.8G 354G 14 97.5k 6 39.2k exists 5 sky141 130G 315G 8 39.1k 3 88.9k exists 6 sky142 96.0G 350G 9 50.3k 3 160k exists,up 7 sky142 165G 281G 0 0 1 89.6k exists,up 8 sky142 75.8G 370G 0 6553 1 25.6k exists,up 9 sky142 199G 247G 0 3276 3 172k exists,up 10 sky142 122G 324G 2 13.5k 9 510k exists,up 11 sky142 95.3G 351G 1 4096 6 126k exists,up 12 sky143 184G 262G 3 12.0k 1 25.6k exists,up 13 sky143 93.6G 353G 0 0 0 5734 exists,up 14 sky143 67.8G 378G 12 71.1k 13 364k exists,up 15 sky143 92.6G 354G 0 819 0 0 exists,up 16 sky143 142G 303G 0 819 2 24.0k exists,up 17 sky143 179G 267G 0 2457 5 99.2k exists,up
#
Remove diskconnect to the host sky141
use CLI
remove_disk
it show/dev/sde
is associated with id 4,5 on index 3.then we remove
/dev/sde
from the ceph pool.Remove the Hard disk from the nodes
sky141:storage> remove_disk index name size osd serial-- 1 /dev/sda 894.3G 0 1 S40FNA0M800607 2 /dev/sdc 894.3G 2 3 S40FNA0M800598 3 /dev/sde 894.3G 4 5 S40FNA0M800608--Enter the index of disk to be removed: 3Disk removal mode (safe/force): forceforce mode immediately destroys disk data without taking into accounts ofstorage status so USE IT AT YOUR OWN RISK.Enter 'YES' to confirm: YES
let's check the status of our storage pool, ceph is recovering the data automatically
sky141:storage> status cluster: id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85 health: HEALTH_WARN Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized services: mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d) mgr: sky141(active, since 8d), standbys: sky142, sky143 mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 753 pgs objects: 149.94k objects, 785 GiB usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail pgs: 6075/438706 objects degraded (1.385%) 5463/438706 objects misplaced (1.245%) 738 active+clean 8 active+undersized+degraded+remapped+backfilling 7 active+remapped+backfilling io: client: 4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr recovery: 127 MiB/s, 27 objects/s ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE0 sky141 141G 305G 1 28.0k 1 42.4k exists,up1 sky141 177G 268G 11 88.0k 3 26.3k exists,up2 sky141 212G 233G 2 12.7k 0 0 exists,up3 sky141 193G 253G 3 31.1k 7 634k exists,up6 sky142 86.0G 360G 9 40.0k 2 27.1k exists,up7 sky142 179G 267G 7 184k 2 119k exists,up8 sky142 90.8G 355G 0 18.3k 19 1553k exists,up9 sky142 201G 245G 8 35.1k 16 1450k exists,up10 sky142 108G 337G 6 51.1k 11 755k exists,up11 sky142 98.5G 348G 0 6553 2 41.6k exists,up12 sky143 201G 245G 16 100k 3 230k exists,up13 sky143 122G 323G 0 0 0 0 exists,up14 sky143 88.0G 358G 15 76.0k 47 2970k exists,up15 sky143 100G 346G 7 183k 14 1286k exists,up16 sky143 127G 319G 5 28.0k 15 659k exists,up17 sky143 132G 314G 23 225k 9 491k exists,up
#
Results:wait for a while and check the
status
againWe had successfully remove the failed hard disk and the health status are OK
sky141:storage> status cluster: id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85 health: HEALTH_OK services: mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d) mgr: sky141(active, since 8d), standbys: sky142, sky143 mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 16 osds: 16 up (since 21m), 16 in (since 21m) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 753 pgs objects: 149.99k objects, 786 GiB usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail pgs: 753 active+clean io: client: 25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE0 sky141 148G 298G 0 0 0 0 exists,up1 sky141 176G 269G 5 23.1k 1 0 exists,up2 sky141 202G 243G 0 28.0k 1 0 exists,up3 sky141 220G 225G 0 3276 0 0 exists,up6 sky142 86.1G 360G 4 20.7k 0 0 exists,up7 sky142 180G 266G 0 0 0 0 exists,up8 sky142 89.2G 357G 7 49.5k 2 10.3k exists,up9 sky142 201G 245G 0 819 0 0 exists,up10 sky142 108G 337G 1 7372 0 5734 exists,up11 sky142 99.1G 347G 0 12.7k 0 0 exists,up12 sky143 199G 247G 1 5734 1 0 exists,up13 sky143 112G 333G 4 22.3k 0 0 exists,up14 sky143 86.3G 360G 1 18.3k 2 90 exists,up15 sky143 98.7G 347G 0 16.0k 1 0 exists,up16 sky143 128G 318G 1 4915 2 9027 exists,up17 sky143 141G 305G 2 22.3k 0 0 exists,up
#
Add a new diskStory: We had bought a new hard disk to add in to our storage pool
connect to the host (which add the new hard disk)
Adding add new disk to the node with CLI
add_disk
sky141:storage> add_disk index name size serial-- 1 /dev/sde 894.3G S40FNA0M800608--Found 1 available disksEnter the index to add this disk into the pool: 1Enter 'YES' to confirm: YESAdd disk /dev/sde successfully.
wait for a moment, auto recovery is started
sky141:storage> status cluster: id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85 health: HEALTH_WARN Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded services: mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d) mgr: sky141(active, since 8d), standbys: sky142, sky143 mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 753 pgs objects: 149.99k objects, 786 GiB usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail pgs: 277/438826 objects degraded (0.063%) 82685/438826 objects misplaced (18.842%) 660 active+clean 51 active+remapped+backfilling 39 active+remapped+backfill_wait 2 active+remapped 1 active+undersized+degraded+remapped+backfilling io: client: 727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr recovery: 873 MiB/s, 2 keys/s, 165 objects/s ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE0 sky141 149G 297G 0 6553 0 0 exists,up1 sky141 175G 271G 0 4096 1 18.6k exists,up2 sky141 202G 244G 0 66.3k 3 34.4k exists,up3 sky141 221G 225G 4 16.0k 1 17.0k exists,up4 sky141 6397M 440G 0 0 0 0 exists,up5 sky141 3304M 443G 0 0 0 72 exists,up6 sky142 86.3G 360G 3 15.4k 3 60.5k exists,up7 sky142 180G 266G 0 585 1 8192 exists,up8 sky142 89.2G 357G 0 23.1k 5 128k exists,up9 sky142 201G 245G 11 70.3k 3 46.4k exists,up10 sky142 109G 337G 1 17.1k 1 28.0k exists,up11 sky142 99.1G 347G 0 0 0 0 exists,up12 sky143 200G 245G 3 16.0k 4 193k exists,up13 sky143 114G 332G 0 4915 0 0 exists,up14 sky143 86.3G 360G 5 41.0k 12 303k exists,up15 sky143 99.3G 347G 2 89.3k 5 129k exists,up16 sky143 128G 317G 0 2457 1 16 exists,up17 sky143 143G 303G 13 84.0k 1 7372 exists,up
Result: osd size is changed from 16 to 18
sky141:storage> status cluster: id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85 health: HEALTH_OK services: mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d) mgr: sky141(active, since 8d), standbys: sky142, sky143 mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 18 osds: 18 up (since 76m), 18 in (since 76m) rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 25 pools, 753 pgs objects: 150.06k objects, 786 GiB usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail pgs: 753 active+clean io: client: 1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE0 sky141 64.9G 381G 3 30.3k 0 0 exists,up1 sky141 180G 266G 8 40.7k 0 0 exists,up2 sky141 161G 285G 0 8192 0 0 exists,up3 sky141 131G 314G 3 16.0k 0 0 exists,up4 sky141 91.1G 355G 1 7372 1 0 exists,up5 sky141 130G 316G 0 0 1 90 exists,up6 sky142 96.3G 350G 5 23.1k 0 0 exists,up7 sky142 165G 281G 0 7372 0 0 exists,up8 sky142 76.1G 370G 0 0 1 0 exists,up9 sky142 199G 247G 0 6553 0 0 exists,up10 sky142 122G 324G 2 9011 0 0 exists,up11 sky142 95.5G 351G 2 21.5k 0 0 exists,up12 sky143 184G 262G 2 35.1k 1 0 exists,up13 sky143 95.7G 350G 0 0 0 0 exists,up14 sky143 66.4G 380G 9 44.0k 1 0 exists,up15 sky143 92.9G 353G 1 9011 0 0 exists,up16 sky143 143G 303G 0 6553 1 16 exists,up17 sky143 179G 267G 7 52.0k 1 102 exists,up
#
Remove an osdStory: One of your nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, you need to take action from any active host in your cluster.
- Connect to one of your live hosts.
- Start removing the OSDs by Using the command
storage > remove_osd
to remove all failed OSDs from the list. - Verify the OSDs are removed with the
storage > status
command.s2:storage> remove_osdEnter osd id to be removed:1: down (1.81920)2: down (1.81920)3: osd.31 (hdd)4: osd.35 (hdd)5: osd.36 (hdd)6: osd.38 (hdd)7: osd.41 (hdd)8: osd.42 (hdd)9: osd.44 (hdd)10: osd.46 (hdd)11: osd.48 (hdd)12: osd.50 (hdd)13: osd.52 (hdd)14: osd.54 (hdd)15: osd.64 (hdd)16: osd.65 (hdd)17: osd.66 (hdd)18: osd.67 (hdd)19: osd.68 (hdd)20: osd.69 (hdd)21: osd.70 (hdd)22: osd.71 (hdd)23: osd.72 (hdd)24: osd.73 (hdd)25: osd.74 (hdd)26: osd.75 (hdd)27: osd.76 (hdd)28: osd.77 (hdd)29: osd.78 (hdd)30: osd.79 (hdd)31: osd.21 (ssd)32: osd.23 (ssd)33: osd.25 (ssd)34: osd.27 (ssd)35: osd.60 (ssd)36: osd.61 (ssd)37: osd.62 (ssd)38: osd.63 (ssd)Enter index: 1Enter 'YES' to confirm: YESRemove osd.31 successfully.