Version: 2.0

Hard disk replacement

Remove a failed hard disk#

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health. To do so,

checking storage status before start
as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)

    s1:storage> status      cluster:        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85        health: HEALTH_WARN                2 osds down                Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded          services:        mon: 3 daemons, quorum s1,s2,s3        mgr: s1(active), standbys: s2, s3        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby        osd: 60 osds: 58 up, 60 in        rgw: 3 daemons active          data:        pools:   22 pools, 5488 pgs        objects: 21.59k objects, 82.8GiB        usage:   232GiB used, 92.3TiB / 92.6TiB avail        pgs:     1611/44204 objects degraded (3.644%)                 4873 active+clean                 509  active+undersized                 106  active+undersized+degraded          io:        client:   0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr        recovery: 5B/s, 0objects/s        cache:    0op/s promote        +----+------+-------+-------+--------+---------+--------+---------+-----------+    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |    +----+------+-------+-------+--------+---------+--------+---------+-----------+    | 0  |  s1  | 1597M |  445G |    0   |     0   |    0   |  4096   | exists,up |    | 1  |  s1  | 1543M |  445G |    0   |   819   |    0   |  1638   | exists,up |    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~     | 54 |  s3  | 4562M | 1858G |    2   |  64.0k  |    8   |  32.8k  | exists,up |    | 55 |  s2  | 3784M | 1859G |    6   |   119k  |  429   |  1717k  | exists,up |    | 56 |  s3  | 3552M | 1859G |    0   |     0   |    0   |     0   |   exists  |    | 57 |  s2  | 5285M | 1857G |    3   |  49.6k  |   12   |  76.8k  | exists,up |    | 58 |  s3  | 4921M | 1858G |    0   |     0   |    0   |     0   |   exists  |    | 59 |  s2  | 3865M | 1859G |    1   |  17.6k  |    2   |  9011   | exists,up |    +----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove disk#

connect to the host s3
use CLI remove_disk it show /dev/sdj is associated with id 56,58 on index 10.
then we remove /dev/sdj from the ceph pool.

Remove the Hard disk from the nodes

s3:storage> remove_disk  index          name      size   storage ids--      1      /dev/sda    894.3G         21 23      2      /dev/sdb    894.3G         25 27      3      /dev/sdc      3.7T         29 31      4      /dev/sdd      3.7T         33 35      5      /dev/sde      3.7T         36 38      6      /dev/sdf      3.7T         41 42      7      /dev/sdg      3.7T         44 46      8      /dev/sdh      3.7T         48 50      9      /dev/sdi      3.7T         52 54     10      /dev/sdj      3.7T         56 58--Enter the index of disk to be removed: 10Enter 'YES' to confirm: YESRemove disk /dev/sdj successfully.

let's check the status of our storage pool

as you see, ceph is recover the data automatically

s3:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            276/44228 objects misplaced (0.624%)            Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded
  services:    mon: 3 daemons, quorum s1,s2,s3    mgr: s1(active), standbys: s3, s2    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby    osd: 58 osds: 58 up, 58 in; 7 remapped pgs    rgw: 3 daemons active
  data:    pools:   22 pools, 5488 pgs    objects: 21.60k objects, 82.9GiB    usage:   227GiB used, 88.7TiB / 88.9TiB avail    pgs:     908/44228 objects degraded (2.053%)             276/44228 objects misplaced (0.624%)             5409 active+clean             61   active+recovery_wait+degraded             7    active+recovering+degraded             4    active+recovering             4    active+remapped+backfill_wait             3    active+undersized+remapped+backfill_wait
  io:    client:   273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr    recovery: 135MiB/s, 0keys/s, 37objects/s

Results:#

wait for a while and check the status again

We had successfully remove the failed hard disk and the health status are OK

s1:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
  services:    mon: 3 daemons, quorum s1,s2,s3    mgr: s1(active), standbys: s2, s3    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby    osd: 58 osds: 58 up, 58 in    rgw: 3 daemons active
  data:    pools:   22 pools, 5488 pgs    objects: 21.59k objects, 82.8GiB    usage:   229GiB used, 88.7TiB / 88.9TiB avail    pgs:     5488 active+clean
  io:    client:   132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr+----+------+-------+-------+--------+---------+--------+---------+-----------+| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |+----+------+-------+-------+--------+---------+--------+---------+-----------+| 0  |  s1  | 1594M |  445G |    0   |     0   |    0   |     0   | exists,up || 1  |  s1  | 1536M |  445G |    0   |     0   |    0   |     0   | exists,up |~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ | 54 |  s3  | 4665M | 1858G |    3   |  29.6k  |    0   |     0   | exists,up || 55 |  s2  | 3769M | 1859G |    0   |     0   |    0   |     0   | exists,up || 57 |  s2  | 5366M | 1857G |    0   |   819   |    0   |     0   | exists,up || 59 |  s2  | 3851M | 1859G |    0   |     0   |    0   |     0   | exists,up |+----+------+-------+-------+--------+---------+--------+---------+-----------+

Add/Replace a New Disk#

Story: We had bought a new hard disk to add in to our storage pool

connect to the host (which add the new hard disk)

Adding add new disk to the node with CLI add_disk

s3:storage> add_disk  index          name      size--      1      /dev/sdj      3.7T--Found 1 available diskEnter the index to add this disk into the pool: 1Enter 'YES' to confirm: YESAdd disk /dev/sdj successfully.

wait for a moment, auto recovery is started

s3:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            90/44225 objects misplaced (0.204%)            Degraded data redundancy: 1631/44225 objects degraded (3.688%), 114 pgs degraded
  services:    mon: 3 daemons, quorum s1,s2,s3    mgr: s1(active), standbys: s3, s2    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby    osd: 60 osds: 60 up, 60 in; 10 remapped pgs    rgw: 3 daemons active
  data:    pools:   22 pools, 5488 pgs    objects: 21.59k objects, 82.9GiB    usage:   232GiB used, 92.3TiB / 92.6TiB avail    pgs:     1631/44225 objects degraded (3.688%)             90/44225 objects misplaced (0.204%)             5362 active+clean             113  active+recovery_wait+degraded             9    active+remapped+backfill_wait             2    active+recovering             1    active+remapped+backfilling             1    active+recovering+degraded
  io:    client:   0B/s rd, 78.1KiB/s wr, 15op/s rd, 15op/s wr    recovery: 40.4MiB/s, 12objects/s

Result: osd size is changed from 58 to 60

s3:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
  services:    mon: 3 daemons, quorum s1,s2,s3    mgr: s1(active), standbys: s3, s2    mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby    osd: 60 osds: 60 up, 60 in    rgw: 3 daemons active
  data:    pools:   22 pools, 5488 pgs    objects: 21.59k objects, 82.9GiB    usage:   231GiB used, 92.3TiB / 92.6TiB avail    pgs:     5488 active+clean
+----+------+-------+-------+--------+---------+--------+---------+-----------+| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |+----+------+-------+-------+--------+---------+--------+---------+-----------+| 0  |  s1  | 1587M |  445G |    0   |     0   |    0   |     0   | exists,up || 1  |  s1  | 1535M |  445G |    0   |     0   |    0   |     0   | exists,up |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| 55 |  s2  | 3769M | 1859G |    0   |     0   |    0   |     0   | exists,up || 56 |  s3  | 3525M | 1859G |    0   |     0   |    0   |     0   | exists,up || 57 |  s2  | 5262M | 1857G |    0   |     0   |    0   |     0   | exists,up || 58 |  s3  | 4895M | 1858G |    0   |     0   |    0   |     0   | exists,up || 59 |  s2  | 3851M | 1859G |    0   |     0   |    0   |     0   | exists,up |+----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove an osd#

Story: One of your node has failed to power up with no reason. the osds that host by the failed node went offline. So we have to recover the storage pool ASAP from the health_warn status. you can do it from any host of your cluster.

connect to one of your (live) host
start remove osds with CLI remove_osd and remove all failed osds from the list

after all failed osds is removed, please check your storage health with CLI storage> status

s2:storage> remove_osdEnter osd id to be removed:1:  down (1.81920)2:  down (1.81920)3: osd.31 (hdd)4: osd.35 (hdd)5: osd.36 (hdd)6: osd.38 (hdd)7: osd.41 (hdd)8: osd.42 (hdd)9: osd.44 (hdd)10: osd.46 (hdd)11: osd.48 (hdd)12: osd.50 (hdd)13: osd.52 (hdd)14: osd.54 (hdd)15: osd.64 (hdd)16: osd.65 (hdd)17: osd.66 (hdd)18: osd.67 (hdd)19: osd.68 (hdd)20: osd.69 (hdd)21: osd.70 (hdd)22: osd.71 (hdd)23: osd.72 (hdd)24: osd.73 (hdd)25: osd.74 (hdd)26: osd.75 (hdd)27: osd.76 (hdd)28: osd.77 (hdd)29: osd.78 (hdd)30: osd.79 (hdd)31: osd.21 (ssd)32: osd.23 (ssd)33: osd.25 (ssd)34: osd.27 (ssd)35: osd.60 (ssd)36: osd.61 (ssd)37: osd.62 (ssd)38: osd.63 (ssd)Enter index: 1Enter 'YES' to confirm: YESRemove osd.31 successfully.

prepare_disk#

Story: if you have a failed hard disk and wish to remove it from your storage pool but you cannot unplug hard disk from your node. Without any physical hard disk removal. prepare_disk will remove the hard disk from the storage pool and delete the partitions table, So that it will permanently removed from your storage pool, even you didn't remove the failed hard disk from the server.

connect to the host

remove the disk with CLI prepare_disk

s3:storage> prepare_disk  index          name      size   storage ids--      1      /dev/sda    894.3G         21 23      2      /dev/sdb    894.3G         25 27      3      /dev/sdc      3.7T         29 31      4      /dev/sdd      3.7T         33 35      5      /dev/sde      3.7T         36 38      6      /dev/sdf      3.7T         41 42      7      /dev/sdg      3.7T         44 46      8      /dev/sdh      3.7T         48 50      9      /dev/sdi      3.7T         52 54     10      /dev/sdj      3.7T         56 58--Enter the index of disk to be removed: 10Enter 'YES' to confirm: YESRemove disk /dev/sdj successfully.