Version: 2.5

Hard Disk Replacement

Remove a Failed Dard Disk

Storage Status

Scenario:

If we find a failed hard disk, it is critical to remove it from your cluster and restore the health of your storage pool.

To proceed:

Check the storage status first.
As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            2 osds down
            Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

    services:
      mon: 3 daemons, quorum  sky141,sky142,sky143 (age 8d)
      mgr:  sky141(active, since 8d), standbys: sky142, sky143
      mds: 1/1 daemons up, 1 standby, 1 hot standby
      osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
      rgw: 3 daemons active (3 hosts, 1 zones)

    data:
      volumes: 1/1 healthy
      pools:   25 pools, 753 pgs
      objects: 149.93k objects, 785 GiB
      usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
      pgs:     753 active+clean

    io:
      client:   5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141  65.7G   380G      6     40.0k      7      161k  exists,up
1   sky141   181G   265G      8     32.7k      1     58.4k  exists,up
2   sky141   162G   283G      0     4096      15      604k  exists,up
3   sky141   133G   313G      0     1638       2     29.6k  exists,up
4   sky141  91.8G   354G     14     97.5k      6     39.2k  exists
5   sky141   130G   315G      8     39.1k      3     88.9k  exists
6   sky142  96.0G   350G      9     50.3k      3      160k  exists,up
7   sky142   165G   281G      0        0       1     89.6k  exists,up
8   sky142  75.8G   370G      0     6553       1     25.6k  exists,up
9   sky142   199G   247G      0     3276       3      172k  exists,up
10  sky142   122G   324G      2     13.5k      9      510k  exists,up
11  sky142  95.3G   351G      1     4096       6      126k  exists,up
12  sky143   184G   262G      3     12.0k      1     25.6k  exists,up
13  sky143  93.6G   353G      0        0       0     5734   exists,up
14  sky143  67.8G   378G     12     71.1k     13      364k  exists,up
15  sky143  92.6G   354G      0      819       0        0   exists,up
16  sky143   142G   303G      0      819       2     24.0k  exists,up
17  sky143   179G   267G      0     2457       5     99.2k  exists,up

Remove the Disk

Connect to the host sky141
Use CLI remove_disk, and find out /dev/sde is associated with ID 4,5 on index 3
Then, we remove /dev/sde from the ceph pool

sky141:storage> remove_disk
  index          name      size     osd              serial
--
      1      /dev/sda    894.3G     0 1      S40FNA0M800607
      2      /dev/sdc    894.3G     2 3      S40FNA0M800598
      3      /dev/sde    894.3G     4 5      S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YES

Check the Status Again

Let's check the status of our storage pool, Cepth should be recovering the data automatically.

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.94k objects, 785 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     6075/438706 objects degraded (1.385%)
            5463/438706 objects misplaced (1.245%)
            738 active+clean
            8   active+undersized+degraded+remapped+backfilling
            7   active+remapped+backfilling

  io:
    client:   4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
    recovery: 127 MiB/s, 27 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   141G   305G      1     28.0k      1     42.4k  exists,up
1   sky141   177G   268G     11     88.0k      3     26.3k  exists,up
2   sky141   212G   233G      2     12.7k      0        0   exists,up
3   sky141   193G   253G      3     31.1k      7      634k  exists,up
6   sky142  86.0G   360G      9     40.0k      2     27.1k  exists,up
7   sky142   179G   267G      7      184k      2      119k  exists,up
8   sky142  90.8G   355G      0     18.3k     19     1553k  exists,up
9   sky142   201G   245G      8     35.1k     16     1450k  exists,up
10  sky142   108G   337G      6     51.1k     11      755k  exists,up
11  sky142  98.5G   348G      0     6553       2     41.6k  exists,up
12  sky143   201G   245G     16      100k      3      230k  exists,up
13  sky143   122G   323G      0        0       0        0   exists,up
14  sky143  88.0G   358G     15     76.0k     47     2970k  exists,up
15  sky143   100G   346G      7      183k     14     1286k  exists,up
16  sky143   127G   319G      5     28.0k     15      659k  exists,up
17  sky143   132G   314G     23      225k      9      491k  exists,up

Results:

Wait for a while and check the status again.

We had successfully removed the failed hard disk and the health status was back to OK.

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     753 active+clean

  io:
    client:   25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   148G   298G      0        0       0        0   exists,up
1   sky141   176G   269G      5     23.1k      1        0   exists,up
2   sky141   202G   243G      0     28.0k      1        0   exists,up
3   sky141   220G   225G      0     3276       0        0   exists,up
6   sky142  86.1G   360G      4     20.7k      0        0   exists,up
7   sky142   180G   266G      0        0       0        0   exists,up
8   sky142  89.2G   357G      7     49.5k      2     10.3k  exists,up
9   sky142   201G   245G      0      819       0        0   exists,up
10  sky142   108G   337G      1     7372       0     5734   exists,up
11  sky142  99.1G   347G      0     12.7k      0        0   exists,up
12  sky143   199G   247G      1     5734       1        0   exists,up
13  sky143   112G   333G      4     22.3k      0        0   exists,up
14  sky143  86.3G   360G      1     18.3k      2       90   exists,up
15  sky143  98.7G   347G      0     16.0k      1        0   exists,up
16  sky143   128G   318G      1     4915       2     9027   exists,up
17  sky143   141G   305G      2     22.3k      0        0   exists,up

Recommended Procedure After Removing a Disk

If the disk was originally encrypted, not much should be done for data security concerns.

If the disk was added with the raw mode, it is recommended to shred the disk using the command shred.

# With the number of iterations default to 3
sudo shred -vfz [disk]

# With a custom number of iterations
sudo shred -vfz -n [num_of_iteration] [disk]

For more detailed options, please consult man.

Although we would not need to shred an encrypted disk after the removal, we would need to keep in mind while using the encryption, we would lose some degree of CPU and IO efficiency.

Add a New Disk

Scenario: We had a new hard disk to add to our storage pool.

Connect to the host, which is added with the new hard disk.

Add the add new disk to the node with CLI storage > add_disk.

sky141:storage> add_disk
  index          name      size              serial
--
      1      /dev/sde    894.3G     S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.

Wait for a moment, auto recovery should be started.

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     277/438826 objects degraded (0.063%)
            82685/438826 objects misplaced (18.842%)
            660 active+clean
            51  active+remapped+backfilling
            39  active+remapped+backfill_wait
            2   active+remapped
            1   active+undersized+degraded+remapped+backfilling

  io:
    client:   727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
    recovery: 873 MiB/s, 2 keys/s, 165 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141   149G   297G      0     6553       0        0   exists,up
1   sky141   175G   271G      0     4096       1     18.6k  exists,up
2   sky141   202G   244G      0     66.3k      3     34.4k  exists,up
3   sky141   221G   225G      4     16.0k      1     17.0k  exists,up
4   sky141  6397M   440G      0        0       0        0   exists,up
5   sky141  3304M   443G      0        0       0       72   exists,up
6   sky142  86.3G   360G      3     15.4k      3     60.5k  exists,up
7   sky142   180G   266G      0      585       1     8192   exists,up
8   sky142  89.2G   357G      0     23.1k      5      128k  exists,up
9   sky142   201G   245G     11     70.3k      3     46.4k  exists,up
10  sky142   109G   337G      1     17.1k      1     28.0k  exists,up
11  sky142  99.1G   347G      0        0       0        0   exists,up
12  sky143   200G   245G      3     16.0k      4      193k  exists,up
13  sky143   114G   332G      0     4915       0        0   exists,up
14  sky143  86.3G   360G      5     41.0k     12      303k  exists,up
15  sky143  99.3G   347G      2     89.3k      5      129k  exists,up
16  sky143   128G   317G      0     2457       1       16   exists,up
17  sky143   143G   303G     13     84.0k      1     7372   exists,up

Result: OSD size is changed from 16 to 18.

sky141:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)
    mgr: sky141(active, since 8d), standbys: sky142, sky143
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 150.06k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     753 active+clean

  io:
    client:   1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   sky141  64.9G   381G      3     30.3k      0        0   exists,up
1   sky141   180G   266G      8     40.7k      0        0   exists,up
2   sky141   161G   285G      0     8192       0        0   exists,up
3   sky141   131G   314G      3     16.0k      0        0   exists,up
4   sky141  91.1G   355G      1     7372       1        0   exists,up
5   sky141   130G   316G      0        0       1       90   exists,up
6   sky142  96.3G   350G      5     23.1k      0        0   exists,up
7   sky142   165G   281G      0     7372       0        0   exists,up
8   sky142  76.1G   370G      0        0       1        0   exists,up
9   sky142   199G   247G      0     6553       0        0   exists,up
10  sky142   122G   324G      2     9011       0        0   exists,up
11  sky142  95.5G   351G      2     21.5k      0        0   exists,up
12  sky143   184G   262G      2     35.1k      1        0   exists,up
13  sky143  95.7G   350G      0        0       0        0   exists,up
14  sky143  66.4G   380G      9     44.0k      1        0   exists,up
15  sky143  92.9G   353G      1     9011       0        0   exists,up
16  sky143   143G   303G      0     6553       1       16   exists,up
17  sky143   179G   267G      7     52.0k      1      102   exists,up

Remove an OSD

Scenario: One of the nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, we would need to take action from any active host in the cluster.

Connect to one of the live hosts.

Start removing the OSDs by using the command storage > remove_osd to remove all failed OSDs from the list.

Verify the OSDs are removed with the storage > status command.

s2:storage> remove_osd
Enter osd id to be removed:
 down (1.81920)
 down (1.81920)
osd.31 (hdd)
osd.35 (hdd)
osd.36 (hdd)
osd.38 (hdd)
osd.41 (hdd)
osd.42 (hdd)
osd.44 (hdd)
osd.46 (hdd)
osd.48 (hdd)
osd.50 (hdd)
osd.52 (hdd)
osd.54 (hdd)
osd.64 (hdd)
osd.65 (hdd)
osd.66 (hdd)
osd.67 (hdd)
osd.68 (hdd)
osd.69 (hdd)
osd.70 (hdd)
osd.71 (hdd)
osd.72 (hdd)
osd.73 (hdd)
osd.74 (hdd)
osd.75 (hdd)
osd.76 (hdd)
osd.77 (hdd)
osd.78 (hdd)
osd.79 (hdd)
osd.21 (ssd)
osd.23 (ssd)
osd.25 (ssd)
osd.27 (ssd)
osd.60 (ssd)
osd.61 (ssd)
osd.62 (ssd)
osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.

Remove a Failed Dard Disk​

Storage Status​

Remove the Disk​

Check the Status Again​

Results:​

Recommended Procedure After Removing a Disk​

Add a New Disk​

Remove an OSD​