Version: 2.0

Remove a failed hard disk

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health.

checking storage status before start

as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)

s1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
            2 osds down
            Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

  services:
    mon: 3 daemons, quorum s1,s2,s3
    mgr: s1(active), standbys: s2, s3
    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
    osd: 60 osds: 58 up, 60 in
    rgw: 3 daemons active

  data:
    pools:   22 pools, 5488 pgs
    objects: 21.59k objects, 82.8GiB
    usage:   232GiB used, 92.3TiB / 92.6TiB avail
    pgs:     1611/44204 objects degraded (3.644%)
             4873 active+clean
             509  active+undersized
             106  active+undersized+degraded

  io:
    client:   0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr
    recovery: 5B/s, 0objects/s
    cache:    0op/s promote

+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0  |  s1  | 1597M |  445G |    0   |     0   |    0   |  4096   | exists,up |
| 1  |  s1  | 1543M |  445G |    0   |   819   |    0   |  1638   | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
| 54 |  s3  | 4562M | 1858G |    2   |  64.0k  |    8   |  32.8k  | exists,up |
| 55 |  s2  | 3784M | 1859G |    6   |   119k  |  429   |  1717k  | exists,up |
| 56 |  s3  | 3552M | 1859G |    0   |     0   |    0   |     0   |   exists  |
| 57 |  s2  | 5285M | 1857G |    3   |  49.6k  |   12   |  76.8k  | exists,up |
| 58 |  s3  | 4921M | 1858G |    0   |     0   |    0   |     0   |   exists  |
| 59 |  s2  | 3865M | 1859G |    1   |  17.6k  |    2   |  9011   | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove disk

connect to the host s3

$ ssh [email protected]
Warning: Permanently added '192.168.1.x' (ECDSA) to the list of known hosts.
Password:

use CLI storage> remove_disk it show /dev/sdj is associated with id 56,58 on index 10.
then we remove /dev/sdj from the ceph pool.

Remove the Hard disk from the nodes

s3> storage
s3:storage> remove_disk
  index          name      size   storage ids
--
      1      /dev/sda    894.3G         21 23
      2      /dev/sdb    894.3G         25 27
      3      /dev/sdc      3.7T         29 31
      4      /dev/sdd      3.7T         33 35
      5      /dev/sde      3.7T         36 38
      6      /dev/sdf      3.7T         41 42
      7      /dev/sdg      3.7T         44 46
      8      /dev/sdh      3.7T         48 50
      9      /dev/sdi      3.7T         52 54
     10      /dev/sdj      3.7T         56 58
--
Enter the index of disk to be removed: 10
Enter 'YES' to confirm: YES
Remove disk /dev/sdj successfully.

let's check the status of our storage pool, as you see, ceph is recover the data automatically

 s3:storage> status
   cluster:
     id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
     health: HEALTH_WARN
             276/44228 objects misplaced (0.624%)
             Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded
 
   services:
     mon: 3 daemons, quorum s1,s2,s3
     mgr: s1(active), standbys: s3, s2
     mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
     osd: 58 osds: 58 up, 58 in; 7 remapped pgs
     rgw: 3 daemons active
 
   data:
     pools:   22 pools, 5488 pgs
     objects: 21.60k objects, 82.9GiB
     usage:   227GiB used, 88.7TiB / 88.9TiB avail
     pgs:     908/44228 objects degraded (2.053%)
              276/44228 objects misplaced (0.624%)
              5409 active+clean
              61   active+recovery_wait+degraded
              7    active+recovering+degraded
              4    active+recovering
              4    active+remapped+backfill_wait
              3    active+undersized+remapped+backfill_wait
 
   io:
     client:   273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr
     recovery: 135MiB/s, 0keys/s, 37objects/s

Results

wait for a while and check the storage> status again
We had successfully remove the failed hard disk and the health status are OK

s1:storage> status
  cluster:
    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum s1,s2,s3
    mgr: s1(active), standbys: s2, s3
    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
    osd: 58 osds: 58 up, 58 in
    rgw: 3 daemons active

  data:
    pools:   22 pools, 5488 pgs
    objects: 21.59k objects, 82.8GiB
    usage:   229GiB used, 88.7TiB / 88.9TiB avail
    pgs:     5488 active+clean

  io:
    client:   132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0  |  s1  | 1594M |  445G |    0   |     0   |    0   |     0   | exists,up |
| 1  |  s1  | 1536M |  445G |    0   |     0   |    0   |     0   | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
| 54 |  s3  | 4665M | 1858G |    3   |  29.6k  |    0   |     0   | exists,up |
| 55 |  s2  | 3769M | 1859G |    0   |     0   |    0   |     0   | exists,up |
| 57 |  s2  | 5366M | 1857G |    0   |   819   |    0   |     0   | exists,up |
| 59 |  s2  | 3851M | 1859G |    0   |     0   |    0   |     0   | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health.​

checking storage status before start​

Remove disk​

Results​

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health.

checking storage status before start

Remove disk

Results