Skip to main content
Version: 2.0

Remove a failed hard disk

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health.

checking storage status before start

  • as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)
s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 60 osds: 58 up, 60 in
rgw: 3 daemons active

data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 232GiB used, 92.3TiB / 92.6TiB avail
pgs: 1611/44204 objects degraded (3.644%)
4873 active+clean
509 active+undersized
106 active+undersized+degraded

io:
client: 0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr
recovery: 5B/s, 0objects/s
cache: 0op/s promote

+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1597M | 445G | 0 | 0 | 0 | 4096 | exists,up |
| 1 | s1 | 1543M | 445G | 0 | 819 | 0 | 1638 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4562M | 1858G | 2 | 64.0k | 8 | 32.8k | exists,up |
| 55 | s2 | 3784M | 1859G | 6 | 119k | 429 | 1717k | exists,up |
| 56 | s3 | 3552M | 1859G | 0 | 0 | 0 | 0 | exists |
| 57 | s2 | 5285M | 1857G | 3 | 49.6k | 12 | 76.8k | exists,up |
| 58 | s3 | 4921M | 1858G | 0 | 0 | 0 | 0 | exists |
| 59 | s2 | 3865M | 1859G | 1 | 17.6k | 2 | 9011 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove disk

  • connect to the host s3
$ ssh [email protected]
Warning: Permanently added '192.168.1.x' (ECDSA) to the list of known hosts.
Password:
  • use CLI storage> remove_disk it show /dev/sdj is associated with id 56,58 on index 10.

  • then we remove /dev/sdj from the ceph pool.

  • Remove the Hard disk from the nodes

    s3> storage
    s3:storage> remove_disk
    index name size storage ids
    --
    1 /dev/sda 894.3G 21 23
    2 /dev/sdb 894.3G 25 27
    3 /dev/sdc 3.7T 29 31
    4 /dev/sdd 3.7T 33 35
    5 /dev/sde 3.7T 36 38
    6 /dev/sdf 3.7T 41 42
    7 /dev/sdg 3.7T 44 46
    8 /dev/sdh 3.7T 48 50
    9 /dev/sdi 3.7T 52 54
    10 /dev/sdj 3.7T 56 58
    --
    Enter the index of disk to be removed: 10
    Enter 'YES' to confirm: YES
    Remove disk /dev/sdj successfully.
  • let's check the status of our storage pool, as you see, ceph is recover the data automatically

     s3:storage> status
    cluster:
    id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
    health: HEALTH_WARN
    276/44228 objects misplaced (0.624%)
    Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded

    services:
    mon: 3 daemons, quorum s1,s2,s3
    mgr: s1(active), standbys: s3, s2
    mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
    osd: 58 osds: 58 up, 58 in; 7 remapped pgs
    rgw: 3 daemons active

    data:
    pools: 22 pools, 5488 pgs
    objects: 21.60k objects, 82.9GiB
    usage: 227GiB used, 88.7TiB / 88.9TiB avail
    pgs: 908/44228 objects degraded (2.053%)
    276/44228 objects misplaced (0.624%)
    5409 active+clean
    61 active+recovery_wait+degraded
    7 active+recovering+degraded
    4 active+recovering
    4 active+remapped+backfill_wait
    3 active+undersized+remapped+backfill_wait

    io:
    client: 273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr
    recovery: 135MiB/s, 0keys/s, 37objects/s

Results

  • wait for a while and check the storage> status again
  • We had successfully remove the failed hard disk and the health status are OK
s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 58 osds: 58 up, 58 in
rgw: 3 daemons active

data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 229GiB used, 88.7TiB / 88.9TiB avail
pgs: 5488 active+clean

io:
client: 132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1594M | 445G | 0 | 0 | 0 | 0 | exists,up |
| 1 | s1 | 1536M | 445G | 0 | 0 | 0 | 0 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4665M | 1858G | 3 | 29.6k | 0 | 0 | exists,up |
| 55 | s2 | 3769M | 1859G | 0 | 0 | 0 | 0 | exists,up |
| 57 | s2 | 5366M | 1857G | 0 | 819 | 0 | 0 | exists,up |
| 59 | s2 | 3851M | 1859G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+