Version: 2.4

Hard disk replacement

Remove a failed hard disk

Storage status

Story: If you discover a failed hard disk, it is essential to remove it from your cluster and restore the health of your storage pool. To proceed:

Check the storage status before starting:
As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.

    sky141:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_WARN
                2 osds down
                Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded
    
        services:
          mon: 3 daemons, quorum  sky141,sky142,sky143 (age 8d)
          mgr:  sky141(active, since 8d), standbys: sky142, sky143
          mds: 1/1 daemons up, 1 standby, 1 hot standby
          osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
          rgw: 3 daemons active (3 hosts, 1 zones)

        data:
          volumes: 1/1 healthy
          pools:   25 pools, 753 pgs
          objects: 149.93k objects, 785 GiB
          usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
          pgs:     753 active+clean

        io:
          client:   5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

      ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
      0   sky141  65.7G   380G      6     40.0k      7      161k  exists,up
      1   sky141   181G   265G      8     32.7k      1     58.4k  exists,up
      2   sky141   162G   283G      0     4096      15      604k  exists,up
      3   sky141   133G   313G      0     1638       2     29.6k  exists,up
      4   sky141  91.8G   354G     14     97.5k      6     39.2k  exists
      5   sky141   130G   315G      8     39.1k      3     88.9k  exists
      6   sky142  96.0G   350G      9     50.3k      3      160k  exists,up
      7   sky142   165G   281G      0        0       1     89.6k  exists,up
      8   sky142  75.8G   370G      0     6553       1     25.6k  exists,up
      9   sky142   199G   247G      0     3276       3      172k  exists,up
      10  sky142   122G   324G      2     13.5k      9      510k  exists,up
      11  sky142  95.3G   351G      1     4096       6      126k  exists,up
      12  sky143   184G   262G      3     12.0k      1     25.6k  exists,up
      13  sky143  93.6G   353G      0        0       0     5734   exists,up
      14  sky143  67.8G   378G     12     71.1k     13      364k  exists,up
      15  sky143  92.6G   354G      0      819       0        0   exists,up
      16  sky143   142G   303G      0      819       2     24.0k  exists,up
      17  sky143   179G   267G      0     2457       5     99.2k  exists,up

Remove disk

connect to the host sky141
use CLI remove_disk it show /dev/sde is associated with id 4,5 on index 3.
then we remove /dev/sde from the ceph pool.

Remove the Hard disk from the nodes

sky141:storage> remove_disk
  index          name      size     osd              serial
--
      1      /dev/sda    894.3G     0 1      S40FNA0M800607
      2      /dev/sdc    894.3G     2 3      S40FNA0M800598
      3      /dev/sde    894.3G     4 5      S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YES

let's check the status of our storage pool, ceph is recovering the data automatically

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.94k objects, 785 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     6075/438706 objects degraded (1.385%)
            5463/438706 objects misplaced (1.245%)
            738 active+clean
            8   active+undersized+degraded+remapped+backfilling
            7   active+remapped+backfilling

  io:
    client:   4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
    recovery: 127 MiB/s, 27 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   141G   305G      1     28.0k      1     42.4k  exists,up
1   sky141   177G   268G     11     88.0k      3     26.3k  exists,up
2   sky141   212G   233G      2     12.7k      0        0   exists,up
3   sky141   193G   253G      3     31.1k      7      634k  exists,up
6   sky142  86.0G   360G      9     40.0k      2     27.1k  exists,up
7   sky142   179G   267G      7      184k      2      119k  exists,up
8   sky142  90.8G   355G      0     18.3k     19     1553k  exists,up
9   sky142   201G   245G      8     35.1k     16     1450k  exists,up
10  sky142   108G   337G      6     51.1k     11      755k  exists,up
11  sky142  98.5G   348G      0     6553       2     41.6k  exists,up
12  sky143   201G   245G     16      100k      3      230k  exists,up
13  sky143   122G   323G      0        0       0        0   exists,up
14  sky143  88.0G   358G     15     76.0k     47     2970k  exists,up
15  sky143   100G   346G      7      183k     14     1286k  exists,up
16  sky143   127G   319G      5     28.0k     15      659k  exists,up
17  sky143   132G   314G     23      225k      9      491k  exists,up

Results:

wait for a while and check the status again

We had successfully remove the failed hard disk and the health status are OK

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     753 active+clean

  io:
    client:   25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   148G   298G      0        0       0        0   exists,up
1   sky141   176G   269G      5     23.1k      1        0   exists,up
2   sky141   202G   243G      0     28.0k      1        0   exists,up
3   sky141   220G   225G      0     3276       0        0   exists,up
6   sky142  86.1G   360G      4     20.7k      0        0   exists,up
7   sky142   180G   266G      0        0       0        0   exists,up
8   sky142  89.2G   357G      7     49.5k      2     10.3k  exists,up
9   sky142   201G   245G      0      819       0        0   exists,up
10  sky142   108G   337G      1     7372       0     5734   exists,up
11  sky142  99.1G   347G      0     12.7k      0        0   exists,up
12  sky143   199G   247G      1     5734       1        0   exists,up
13  sky143   112G   333G      4     22.3k      0        0   exists,up
14  sky143  86.3G   360G      1     18.3k      2       90   exists,up
15  sky143  98.7G   347G      0     16.0k      1        0   exists,up
16  sky143   128G   318G      1     4915       2     9027   exists,up
17  sky143   141G   305G      2     22.3k      0        0   exists,up

Add a new disk

Story: We had bought a new hard disk to add in to our storage pool

connect to the host (which add the new hard disk)

Adding add new disk to the node with CLI add_disk

sky141:storage> add_disk
  index          name      size              serial
--
      1      /dev/sde    894.3G     S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.

wait for a moment, auto recovery is started

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     277/438826 objects degraded (0.063%)
            82685/438826 objects misplaced (18.842%)
            660 active+clean
            51  active+remapped+backfilling
            39  active+remapped+backfill_wait
            2   active+remapped
            1   active+undersized+degraded+remapped+backfilling

  io:
    client:   727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
    recovery: 873 MiB/s, 2 keys/s, 165 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   149G   297G      0     6553       0        0   exists,up
1   sky141   175G   271G      0     4096       1     18.6k  exists,up
2   sky141   202G   244G      0     66.3k      3     34.4k  exists,up
3   sky141   221G   225G      4     16.0k      1     17.0k  exists,up
4   sky141  6397M   440G      0        0       0        0   exists,up
5   sky141  3304M   443G      0        0       0       72   exists,up
6   sky142  86.3G   360G      3     15.4k      3     60.5k  exists,up
7   sky142   180G   266G      0      585       1     8192   exists,up
8   sky142  89.2G   357G      0     23.1k      5      128k  exists,up
9   sky142   201G   245G     11     70.3k      3     46.4k  exists,up
10  sky142   109G   337G      1     17.1k      1     28.0k  exists,up
11  sky142  99.1G   347G      0        0       0        0   exists,up
12  sky143   200G   245G      3     16.0k      4      193k  exists,up
13  sky143   114G   332G      0     4915       0        0   exists,up
14  sky143  86.3G   360G      5     41.0k     12      303k  exists,up
15  sky143  99.3G   347G      2     89.3k      5      129k  exists,up
16  sky143   128G   317G      0     2457       1       16   exists,up
17  sky143   143G   303G     13     84.0k      1     7372   exists,up

Result: osd size is changed from 16 to 18

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 150.06k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     753 active+clean

  io:
    client:   1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141  64.9G   381G      3     30.3k      0        0   exists,up
1   sky141   180G   266G      8     40.7k      0        0   exists,up
2   sky141   161G   285G      0     8192       0        0   exists,up
3   sky141   131G   314G      3     16.0k      0        0   exists,up
4   sky141  91.1G   355G      1     7372       1        0   exists,up
5   sky141   130G   316G      0        0       1       90   exists,up
6   sky142  96.3G   350G      5     23.1k      0        0   exists,up
7   sky142   165G   281G      0     7372       0        0   exists,up
8   sky142  76.1G   370G      0        0       1        0   exists,up
9   sky142   199G   247G      0     6553       0        0   exists,up
10  sky142   122G   324G      2     9011       0        0   exists,up
11  sky142  95.5G   351G      2     21.5k      0        0   exists,up
12  sky143   184G   262G      2     35.1k      1        0   exists,up
13  sky143  95.7G   350G      0        0       0        0   exists,up
14  sky143  66.4G   380G      9     44.0k      1        0   exists,up
15  sky143  92.9G   353G      1     9011       0        0   exists,up
16  sky143   143G   303G      0     6553       1       16   exists,up
17  sky143   179G   267G      7     52.0k      1      102   exists,up

Remove an osd

Story: One of your nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, you need to take action from any active host in your cluster.

Connect to one of your live hosts.
Start removing the OSDs by Using the command storage > remove_osd to remove all failed OSDs from the list.

Verify the OSDs are removed with the storage > status command.

s2:storage> remove_osd
Enter osd id to be removed:
 down (1.81920)
 down (1.81920)
osd.31 (hdd)
osd.35 (hdd)
osd.36 (hdd)
osd.38 (hdd)
osd.41 (hdd)
osd.42 (hdd)
osd.44 (hdd)
osd.46 (hdd)
osd.48 (hdd)
osd.50 (hdd)
osd.52 (hdd)
osd.54 (hdd)
osd.64 (hdd)
osd.65 (hdd)
osd.66 (hdd)
osd.67 (hdd)
osd.68 (hdd)
osd.69 (hdd)
osd.70 (hdd)
osd.71 (hdd)
osd.72 (hdd)
osd.73 (hdd)
osd.74 (hdd)
osd.75 (hdd)
osd.76 (hdd)
osd.77 (hdd)
osd.78 (hdd)
osd.79 (hdd)
osd.21 (ssd)
osd.23 (ssd)
osd.25 (ssd)
osd.27 (ssd)
osd.60 (ssd)
osd.61 (ssd)
osd.62 (ssd)
osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.

Remove a failed hard disk​

Storage status​

Remove disk​

Results:​

Add a new disk​

Remove an osd​