Skip to main content
Version: 2.4

Hard disk replacement

Remove a failed hard disk

Storage status

Story: If you discover a failed hard disk, it is essential to remove it from your cluster and restore the health of your storage pool. To proceed:

  • Check the storage status before starting:
  • As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.
    sky141:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

services:
mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
mgr: sky141(active, since 8d), standbys: sky142, sky143
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 25 pools, 753 pgs
objects: 149.93k objects, 785 GiB
usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
pgs: 753 active+clean

io:
client: 5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 sky141 65.7G 380G 6 40.0k 7 161k exists,up
1 sky141 181G 265G 8 32.7k 1 58.4k exists,up
2 sky141 162G 283G 0 4096 15 604k exists,up
3 sky141 133G 313G 0 1638 2 29.6k exists,up
4 sky141 91.8G 354G 14 97.5k 6 39.2k exists
5 sky141 130G 315G 8 39.1k 3 88.9k exists
6 sky142 96.0G 350G 9 50.3k 3 160k exists,up
7 sky142 165G 281G 0 0 1 89.6k exists,up
8 sky142 75.8G 370G 0 6553 1 25.6k exists,up
9 sky142 199G 247G 0 3276 3 172k exists,up
10 sky142 122G 324G 2 13.5k 9 510k exists,up
11 sky142 95.3G 351G 1 4096 6 126k exists,up
12 sky143 184G 262G 3 12.0k 1 25.6k exists,up
13 sky143 93.6G 353G 0 0 0 5734 exists,up
14 sky143 67.8G 378G 12 71.1k 13 364k exists,up
15 sky143 92.6G 354G 0 819 0 0 exists,up
16 sky143 142G 303G 0 819 2 24.0k exists,up
17 sky143 179G 267G 0 2457 5 99.2k exists,up

Remove disk

  • connect to the host sky141

  • use CLI remove_disk it show /dev/sde is associated with id 4,5 on index 3.

  • then we remove /dev/sde from the ceph pool.

  • Remove the Hard disk from the nodes

    sky141:storage> remove_disk
    index name size osd serial
    --
    1 /dev/sda 894.3G 0 1 S40FNA0M800607
    2 /dev/sdc 894.3G 2 3 S40FNA0M800598
    3 /dev/sde 894.3G 4 5 S40FNA0M800608
    --
    Enter the index of disk to be removed: 3
    Disk removal mode (safe/force): force
    force mode immediately destroys disk data without taking into accounts of
    storage status so USE IT AT YOUR OWN RISK.
    Enter 'YES' to confirm: YES
  • let's check the status of our storage pool, ceph is recovering the data automatically

    sky141:storage> status
    cluster:
    id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
    Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

    services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

    data:
    volumes: 1/1 healthy
    pools: 25 pools, 753 pgs
    objects: 149.94k objects, 785 GiB
    usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs: 6075/438706 objects degraded (1.385%)
    5463/438706 objects misplaced (1.245%)
    738 active+clean
    8 active+undersized+degraded+remapped+backfilling
    7 active+remapped+backfilling

    io:
    client: 4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
    recovery: 127 MiB/s, 27 objects/s

    ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
    0 sky141 141G 305G 1 28.0k 1 42.4k exists,up
    1 sky141 177G 268G 11 88.0k 3 26.3k exists,up
    2 sky141 212G 233G 2 12.7k 0 0 exists,up
    3 sky141 193G 253G 3 31.1k 7 634k exists,up
    6 sky142 86.0G 360G 9 40.0k 2 27.1k exists,up
    7 sky142 179G 267G 7 184k 2 119k exists,up
    8 sky142 90.8G 355G 0 18.3k 19 1553k exists,up
    9 sky142 201G 245G 8 35.1k 16 1450k exists,up
    10 sky142 108G 337G 6 51.1k 11 755k exists,up
    11 sky142 98.5G 348G 0 6553 2 41.6k exists,up
    12 sky143 201G 245G 16 100k 3 230k exists,up
    13 sky143 122G 323G 0 0 0 0 exists,up
    14 sky143 88.0G 358G 15 76.0k 47 2970k exists,up
    15 sky143 100G 346G 7 183k 14 1286k exists,up
    16 sky143 127G 319G 5 28.0k 15 659k exists,up
    17 sky143 132G 314G 23 225k 9 491k exists,up

Results:

  • wait for a while and check the status again

  • We had successfully remove the failed hard disk and the health status are OK

    sky141:storage> status
    cluster:
    id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
    rgw: 3 daemons active (3 hosts, 1 zones)

    data:
    volumes: 1/1 healthy
    pools: 25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage: 2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs: 753 active+clean

    io:
    client: 25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

    ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
    0 sky141 148G 298G 0 0 0 0 exists,up
    1 sky141 176G 269G 5 23.1k 1 0 exists,up
    2 sky141 202G 243G 0 28.0k 1 0 exists,up
    3 sky141 220G 225G 0 3276 0 0 exists,up
    6 sky142 86.1G 360G 4 20.7k 0 0 exists,up
    7 sky142 180G 266G 0 0 0 0 exists,up
    8 sky142 89.2G 357G 7 49.5k 2 10.3k exists,up
    9 sky142 201G 245G 0 819 0 0 exists,up
    10 sky142 108G 337G 1 7372 0 5734 exists,up
    11 sky142 99.1G 347G 0 12.7k 0 0 exists,up
    12 sky143 199G 247G 1 5734 1 0 exists,up
    13 sky143 112G 333G 4 22.3k 0 0 exists,up
    14 sky143 86.3G 360G 1 18.3k 2 90 exists,up
    15 sky143 98.7G 347G 0 16.0k 1 0 exists,up
    16 sky143 128G 318G 1 4915 2 9027 exists,up
    17 sky143 141G 305G 2 22.3k 0 0 exists,up

Add a new disk

Story: We had bought a new hard disk to add in to our storage pool

  • connect to the host (which add the new hard disk)

  • Adding add new disk to the node with CLI add_disk

    sky141:storage> add_disk
    index name size serial
    --
    1 /dev/sde 894.3G S40FNA0M800608
    --
    Found 1 available disks
    Enter the index to add this disk into the pool: 1
    Enter 'YES' to confirm: YES
    Add disk /dev/sde successfully.
  • wait for a moment, auto recovery is started

    sky141:storage> status
    cluster:
    id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
    Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

    services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

    data:
    volumes: 1/1 healthy
    pools: 25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs: 277/438826 objects degraded (0.063%)
    82685/438826 objects misplaced (18.842%)
    660 active+clean
    51 active+remapped+backfilling
    39 active+remapped+backfill_wait
    2 active+remapped
    1 active+undersized+degraded+remapped+backfilling

    io:
    client: 727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
    recovery: 873 MiB/s, 2 keys/s, 165 objects/s

    ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
    0 sky141 149G 297G 0 6553 0 0 exists,up
    1 sky141 175G 271G 0 4096 1 18.6k exists,up
    2 sky141 202G 244G 0 66.3k 3 34.4k exists,up
    3 sky141 221G 225G 4 16.0k 1 17.0k exists,up
    4 sky141 6397M 440G 0 0 0 0 exists,up
    5 sky141 3304M 443G 0 0 0 72 exists,up
    6 sky142 86.3G 360G 3 15.4k 3 60.5k exists,up
    7 sky142 180G 266G 0 585 1 8192 exists,up
    8 sky142 89.2G 357G 0 23.1k 5 128k exists,up
    9 sky142 201G 245G 11 70.3k 3 46.4k exists,up
    10 sky142 109G 337G 1 17.1k 1 28.0k exists,up
    11 sky142 99.1G 347G 0 0 0 0 exists,up
    12 sky143 200G 245G 3 16.0k 4 193k exists,up
    13 sky143 114G 332G 0 4915 0 0 exists,up
    14 sky143 86.3G 360G 5 41.0k 12 303k exists,up
    15 sky143 99.3G 347G 2 89.3k 5 129k exists,up
    16 sky143 128G 317G 0 2457 1 16 exists,up
    17 sky143 143G 303G 13 84.0k 1 7372 exists,up
  • Result: osd size is changed from 16 to 18

    sky141:storage> status
    cluster:
    id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
    rgw: 3 daemons active (3 hosts, 1 zones)

    data:
    volumes: 1/1 healthy
    pools: 25 pools, 753 pgs
    objects: 150.06k objects, 786 GiB
    usage: 2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs: 753 active+clean

    io:
    client: 1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

    ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
    0 sky141 64.9G 381G 3 30.3k 0 0 exists,up
    1 sky141 180G 266G 8 40.7k 0 0 exists,up
    2 sky141 161G 285G 0 8192 0 0 exists,up
    3 sky141 131G 314G 3 16.0k 0 0 exists,up
    4 sky141 91.1G 355G 1 7372 1 0 exists,up
    5 sky141 130G 316G 0 0 1 90 exists,up
    6 sky142 96.3G 350G 5 23.1k 0 0 exists,up
    7 sky142 165G 281G 0 7372 0 0 exists,up
    8 sky142 76.1G 370G 0 0 1 0 exists,up
    9 sky142 199G 247G 0 6553 0 0 exists,up
    10 sky142 122G 324G 2 9011 0 0 exists,up
    11 sky142 95.5G 351G 2 21.5k 0 0 exists,up
    12 sky143 184G 262G 2 35.1k 1 0 exists,up
    13 sky143 95.7G 350G 0 0 0 0 exists,up
    14 sky143 66.4G 380G 9 44.0k 1 0 exists,up
    15 sky143 92.9G 353G 1 9011 0 0 exists,up
    16 sky143 143G 303G 0 6553 1 16 exists,up
    17 sky143 179G 267G 7 52.0k 1 102 exists,up

Remove an osd

Story: One of your nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, you need to take action from any active host in your cluster.

  • Connect to one of your live hosts.
  • Start removing the OSDs by Using the command storage > remove_osd to remove all failed OSDs from the list.
  • Verify the OSDs are removed with the storage > status command.
    s2:storage> remove_osd
    Enter osd id to be removed:
    1: down (1.81920)
    2: down (1.81920)
    3: osd.31 (hdd)
    4: osd.35 (hdd)
    5: osd.36 (hdd)
    6: osd.38 (hdd)
    7: osd.41 (hdd)
    8: osd.42 (hdd)
    9: osd.44 (hdd)
    10: osd.46 (hdd)
    11: osd.48 (hdd)
    12: osd.50 (hdd)
    13: osd.52 (hdd)
    14: osd.54 (hdd)
    15: osd.64 (hdd)
    16: osd.65 (hdd)
    17: osd.66 (hdd)
    18: osd.67 (hdd)
    19: osd.68 (hdd)
    20: osd.69 (hdd)
    21: osd.70 (hdd)
    22: osd.71 (hdd)
    23: osd.72 (hdd)
    24: osd.73 (hdd)
    25: osd.74 (hdd)
    26: osd.75 (hdd)
    27: osd.76 (hdd)
    28: osd.77 (hdd)
    29: osd.78 (hdd)
    30: osd.79 (hdd)
    31: osd.21 (ssd)
    32: osd.23 (ssd)
    33: osd.25 (ssd)
    34: osd.27 (ssd)
    35: osd.60 (ssd)
    36: osd.61 (ssd)
    37: osd.62 (ssd)
    38: osd.63 (ssd)
    Enter index: 1
    Enter 'YES' to confirm: YES
    Remove osd.31 successfully.