Version: 3.0

Hard Disk Replacement

Remove a Failed Dard Disk

Storage Status

Scenario:

If we find a failed hard disk, it is critical to remove it from your cluster and restore the health of your storage pool.

To proceed:

Check the storage status first.
As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname cc1.

cc1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            2 osds down
            Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

    services:
      mon: 3 daemons, quorum  cc1,cc2,cc3 (age 8d)
      mgr:  cc1(active, since 8d), standbys: cc2, cc3
      mds: 1/1 daemons up, 1 standby, 1 hot standby
      osd: 18 osds: 16 up (since 2d), 18 in (since 2d)
      rgw: 3 daemons active (3 hosts, 1 zones)

    data:
      volumes: 1/1 healthy
      pools:   25 pools, 753 pgs
      objects: 149.93k objects, 785 GiB
      usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
      pgs:     753 active+clean

    io:
      client:   5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   cc1  65.7G   380G      6     40.0k      7      161k  exists,up
1   cc1   181G   265G      8     32.7k      1     58.4k  exists,up
2   cc1   162G   283G      0     4096      15      604k  exists,up
3   cc1   133G   313G      0     1638       2     29.6k  exists,up
4   cc1  91.8G   354G     14     97.5k      6     39.2k  exists
5   cc1   130G   315G      8     39.1k      3     88.9k  exists
6   cc2  96.0G   350G      9     50.3k      3      160k  exists,up
7   cc2   165G   281G      0        0       1     89.6k  exists,up
8   cc2  75.8G   370G      0     6553       1     25.6k  exists,up
9   cc2   199G   247G      0     3276       3      172k  exists,up
10  cc2   122G   324G      2     13.5k      9      510k  exists,up
11  cc2  95.3G   351G      1     4096       6      126k  exists,up
12  cc3   184G   262G      3     12.0k      1     25.6k  exists,up
13  cc3  93.6G   353G      0        0       0     5734   exists,up
14  cc3  67.8G   378G     12     71.1k     13      364k  exists,up
15  cc3  92.6G   354G      0      819       0        0   exists,up
16  cc3   142G   303G      0      819       2     24.0k  exists,up
17  cc3   179G   267G      0     2457       5     99.2k  exists,up

Remove the Disk

Connect to the host cc1
Use CLI remove_disk, and find out /dev/sde is associated with ID 4,5 on index 3
Then, we remove /dev/sde from the ceph pool

cc1:storage> remove_disk
  index          name      size     osd              serial
--
      1      /dev/sda    894.3G     0 1      S40FNA0M800607
      2      /dev/sdc    894.3G     2 3      S40FNA0M800598
      3      /dev/sde    894.3G     4 5      S40FNA0M800608
--
Enter the index of disk to be removed: 3
Disk removal mode (safe/force): force
force mode immediately destroys disk data without taking into accounts of
storage status so USE IT AT YOUR OWN RISK.
Enter 'YES' to confirm: YES

Check the Status Again

Let's check the status of our storage pool, Cepth should be recovering the data automatically.

cc1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized

  services:
    mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
    mgr: cc1(active, since 8d), standbys: cc2, cc3
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.94k objects, 785 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     6075/438706 objects degraded (1.385%)
            5463/438706 objects misplaced (1.245%)
            738 active+clean
            8   active+undersized+degraded+remapped+backfilling
            7   active+remapped+backfilling

  io:
    client:   4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr
    recovery: 127 MiB/s, 27 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   cc1   141G   305G      1     28.0k      1     42.4k  exists,up
1   cc1   177G   268G     11     88.0k      3     26.3k  exists,up
2   cc1   212G   233G      2     12.7k      0        0   exists,up
3   cc1   193G   253G      3     31.1k      7      634k  exists,up
6   cc2  86.0G   360G      9     40.0k      2     27.1k  exists,up
7   cc2   179G   267G      7      184k      2      119k  exists,up
8   cc2  90.8G   355G      0     18.3k     19     1553k  exists,up
9   cc2   201G   245G      8     35.1k     16     1450k  exists,up
10  cc2   108G   337G      6     51.1k     11      755k  exists,up
11  cc2  98.5G   348G      0     6553       2     41.6k  exists,up
12  cc3   201G   245G     16      100k      3      230k  exists,up
13  cc3   122G   323G      0        0       0        0   exists,up
14  cc3  88.0G   358G     15     76.0k     47     2970k  exists,up
15  cc3   100G   346G      7      183k     14     1286k  exists,up
16  cc3   127G   319G      5     28.0k     15      659k  exists,up
17  cc3   132G   314G     23      225k      9      491k  exists,up

Results:

Wait for a while and check the status again.

We had successfully removed the failed hard disk and the health status was back to OK.

cc1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
    mgr: cc1(active, since 8d), standbys: cc2, cc3
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail
    pgs:     753 active+clean

  io:
    client:   25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   cc1   148G   298G      0        0       0        0   exists,up
1   cc1   176G   269G      5     23.1k      1        0   exists,up
2   cc1   202G   243G      0     28.0k      1        0   exists,up
3   cc1   220G   225G      0     3276       0        0   exists,up
6   cc2  86.1G   360G      4     20.7k      0        0   exists,up
7   cc2   180G   266G      0        0       0        0   exists,up
8   cc2  89.2G   357G      7     49.5k      2     10.3k  exists,up
9   cc2   201G   245G      0      819       0        0   exists,up
10  cc2   108G   337G      1     7372       0     5734   exists,up
11  cc2  99.1G   347G      0     12.7k      0        0   exists,up
12  cc3   199G   247G      1     5734       1        0   exists,up
13  cc3   112G   333G      4     22.3k      0        0   exists,up
14  cc3  86.3G   360G      1     18.3k      2       90   exists,up
15  cc3  98.7G   347G      0     16.0k      1        0   exists,up
16  cc3   128G   318G      1     4915       2     9027   exists,up
17  cc3   141G   305G      2     22.3k      0        0   exists,up

Recommended Procedure After Removing a Disk

If the disk was originally encrypted, not much should be done for data security concerns.

If the disk was added with the raw mode, it is recommended to shred the disk using the command shred.

# With the number of iterations default to 3
sudo shred -vfz [disk]

# With a custom number of iterations
sudo shred -vfz -n [num_of_iteration] [disk]

For more detailed options, please consult man.

Although we would not need to shred an encrypted disk after the removal, we would need to keep in mind while using the encryption, we would lose some degree of CPU and IO efficiency.

Add a New Disk

Scenario: We had a new hard disk to add to our storage pool.

Connect to the host, which is added with the new hard disk.

Add the add new disk to the node with CLI storage > add_disk.

cc1:storage> add_disk
  index          name      size              serial
--
      1      /dev/sde    894.3G     S40FNA0M800608
--
Found 1 available disks
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sde successfully.

Wait for a moment, auto recovery should be started.

cc1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded

  services:
    mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
    mgr: cc1(active, since 8d), standbys: cc2, cc3
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 149.99k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     277/438826 objects degraded (0.063%)
            82685/438826 objects misplaced (18.842%)
            660 active+clean
            51  active+remapped+backfilling
            39  active+remapped+backfill_wait
            2   active+remapped
            1   active+undersized+degraded+remapped+backfilling

  io:
    client:   727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr
    recovery: 873 MiB/s, 2 keys/s, 165 objects/s

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   cc1   149G   297G      0     6553       0        0   exists,up
1   cc1   175G   271G      0     4096       1     18.6k  exists,up
2   cc1   202G   244G      0     66.3k      3     34.4k  exists,up
3   cc1   221G   225G      4     16.0k      1     17.0k  exists,up
4   cc1  6397M   440G      0        0       0        0   exists,up
5   cc1  3304M   443G      0        0       0       72   exists,up
6   cc2  86.3G   360G      3     15.4k      3     60.5k  exists,up
7   cc2   180G   266G      0      585       1     8192   exists,up
8   cc2  89.2G   357G      0     23.1k      5      128k  exists,up
9   cc2   201G   245G     11     70.3k      3     46.4k  exists,up
10  cc2   109G   337G      1     17.1k      1     28.0k  exists,up
11  cc2  99.1G   347G      0        0       0        0   exists,up
12  cc3   200G   245G      3     16.0k      4      193k  exists,up
13  cc3   114G   332G      0     4915       0        0   exists,up
14  cc3  86.3G   360G      5     41.0k     12      303k  exists,up
15  cc3  99.3G   347G      2     89.3k      5      129k  exists,up
16  cc3   128G   317G      0     2457       1       16   exists,up
17  cc3   143G   303G     13     84.0k      1     7372   exists,up

Result: OSD size is changed from 16 to 18.

cc1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cc1,cc2,cc3 (age 8d)
    mgr: cc1(active, since 8d), standbys: cc2, cc3
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   25 pools, 753 pgs
    objects: 150.06k objects, 786 GiB
    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail
    pgs:     753 active+clean

  io:
    client:   1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0   cc1  64.9G   381G      3     30.3k      0        0   exists,up
1   cc1   180G   266G      8     40.7k      0        0   exists,up
2   cc1   161G   285G      0     8192       0        0   exists,up
3   cc1   131G   314G      3     16.0k      0        0   exists,up
4   cc1  91.1G   355G      1     7372       1        0   exists,up
5   cc1   130G   316G      0        0       1       90   exists,up
6   cc2  96.3G   350G      5     23.1k      0        0   exists,up
7   cc2   165G   281G      0     7372       0        0   exists,up
8   cc2  76.1G   370G      0        0       1        0   exists,up
9   cc2   199G   247G      0     6553       0        0   exists,up
10  cc2   122G   324G      2     9011       0        0   exists,up
11  cc2  95.5G   351G      2     21.5k      0        0   exists,up
12  cc3   184G   262G      2     35.1k      1        0   exists,up
13  cc3  95.7G   350G      0        0       0        0   exists,up
14  cc3  66.4G   380G      9     44.0k      1        0   exists,up
15  cc3  92.9G   353G      1     9011       0        0   exists,up
16  cc3   143G   303G      0     6553       1       16   exists,up
17  cc3   179G   267G      7     52.0k      1      102   exists,up

Remove an OSD

Scenario: One of the nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, we would need to take action from any active host in the cluster.

Connect to one of the live hosts.

Start removing the OSDs by using the command storage > remove_osd to remove all failed OSDs from the list.

Verify the OSDs are removed with the storage > status command.

s2:storage> remove_osd
Enter osd id to be removed:
 down (1.81920)
 down (1.81920)
osd.31 (hdd)
osd.35 (hdd)
osd.36 (hdd)
osd.38 (hdd)
osd.41 (hdd)
osd.42 (hdd)
osd.44 (hdd)
osd.46 (hdd)
osd.48 (hdd)
osd.50 (hdd)
osd.52 (hdd)
osd.54 (hdd)
osd.64 (hdd)
osd.65 (hdd)
osd.66 (hdd)
osd.67 (hdd)
osd.68 (hdd)
osd.69 (hdd)
osd.70 (hdd)
osd.71 (hdd)
osd.72 (hdd)
osd.73 (hdd)
osd.74 (hdd)
osd.75 (hdd)
osd.76 (hdd)
osd.77 (hdd)
osd.78 (hdd)
osd.79 (hdd)
osd.21 (ssd)
osd.23 (ssd)
osd.25 (ssd)
osd.27 (ssd)
osd.60 (ssd)
osd.61 (ssd)
osd.62 (ssd)
osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.

Remove a Failed Dard Disk​

Storage Status​

Remove the Disk​

Check the Status Again​

Results:​

Recommended Procedure After Removing a Disk​

Add a New Disk​

Remove an OSD​