Skip to main content
Version: 2.5

Remove a Failed Hard Disk

Scenario: If finding a failing hard disk, you would need to remove it from the cluster and recover the health of the storage pool.

Connect to the Console

ssh [email protected]

Warning: Permanently added '192.168.1.x' (ECDSA) to the list of known hosts.
Password:

Check Storage Status

First, we need to check the status of the storage cluster first.

As shown below, we are having 2 OSDs down due to a failed hard disk, OSD No.56 and No.58 on node hostname (s3).

The command status is under storage mode.

s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded

services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 60 osds: 58 up, 60 in
rgw: 3 daemons active

data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 232GiB used, 92.3TiB / 92.6TiB avail
pgs: 1611/44204 objects degraded (3.644%)
4873 active+clean
509 active+undersized
106 active+undersized+degraded

io:
client: 0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr
recovery: 5B/s, 0objects/s
cache: 0op/s promote

+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1597M | 445G | 0 | 0 | 0 | 4096 | exists,up |
| 1 | s1 | 1543M | 445G | 0 | 819 | 0 | 1638 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4562M | 1858G | 2 | 64.0k | 8 | 32.8k | exists,up |
| 55 | s2 | 3784M | 1859G | 6 | 119k | 429 | 1717k | exists,up |
| 56 | s3 | 3552M | 1859G | 0 | 0 | 0 | 0 | exists |
| 57 | s2 | 5285M | 1857G | 3 | 49.6k | 12 | 76.8k | exists,up |
| 58 | s3 | 4921M | 1858G | 0 | 0 | 0 | 0 | exists |
| 59 | s2 | 3865M | 1859G | 1 | 17.6k | 2 | 9011 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove a Disk

The command remove_disk is under storage mode.

There are two modes to remove a disk, safe mode and force mode.

  • In safe mode, Ceph would try to transfer all PGs on the disk to safely and gracefully remove the disk to protect the existing data.
  • In force mode, no data transferring would be performed, the designated drive would be removed in force. This action is highly possible to corrupt the Ceph cluster.

After entering the command, it would show /dev/sdj is associated with id 56 and id 58 on index 10.

We would need to remove /dev/sdj from the ceph pool.

s3> storage
s3:storage> remove_disk
index name size storage ids
--
1 /dev/sda 894.3G 21 23
2 /dev/sdb 894.3G 25 27
3 /dev/sdc 3.7T 29 31
4 /dev/sdd 3.7T 33 35
5 /dev/sde 3.7T 36 38
6 /dev/sdf 3.7T 41 42
7 /dev/sdg 3.7T 44 46
8 /dev/sdh 3.7T 48 50
9 /dev/sdi 3.7T 52 54
10 /dev/sdj 3.7T 56 58
--
Enter the index of disk to be removed: 10
Disk removal mode:
1: safe
2: force
Enter index: 1
safe mode takes longer by attempting to migrate data on disk(s).
Enter 'YES' to confirm: YES
Remove disk /dev/sdj successfully.

After removing the problematic disk, we could check the status of our storage pool. Ceph should be recovering the data automatically.

s3:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
276/44228 objects misplaced (0.624%)
Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded

services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s3, s2
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 58 osds: 58 up, 58 in; 7 remapped pgs
rgw: 3 daemons active

data:
pools: 22 pools, 5488 pgs
objects: 21.60k objects, 82.9GiB
usage: 227GiB used, 88.7TiB / 88.9TiB avail
pgs: 908/44228 objects degraded (2.053%)
276/44228 objects misplaced (0.624%)
5409 active+clean
61 active+recovery_wait+degraded
7 active+recovering+degraded
4 active+recovering
4 active+remapped+backfill_wait
3 active+undersized+remapped+backfill_wait

io:
client: 273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr
recovery: 135MiB/s, 0keys/s, 37objects/s

Result

After a while, we could check the status by using status under storage mode.

If We have successfully removed the failed hard disk, the health status would be OK.

s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK

services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 58 osds: 58 up, 58 in
rgw: 3 daemons active

data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 229GiB used, 88.7TiB / 88.9TiB avail
pgs: 5488 active+clean

io:
client: 132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1594M | 445G | 0 | 0 | 0 | 0 | exists,up |
| 1 | s1 | 1536M | 445G | 0 | 0 | 0 | 0 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4665M | 1858G | 3 | 29.6k | 0 | 0 | exists,up |
| 55 | s2 | 3769M | 1859G | 0 | 0 | 0 | 0 | exists,up |
| 57 | s2 | 5366M | 1857G | 0 | 819 | 0 | 0 | exists,up |
| 59 | s2 | 3851M | 1859G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+

For more information, check status.

Remove All Existing OSDs on a Node

We would use command remove_exist under storage mode in CLI.

cc1:storage> remove_exist
index name size osd serial
--
1 /dev/sda 745.2G 1 0 BTWA632602ZU800HGN
2 /dev/sdb 745.2G 3 2 BTWA632601X9800HGN
3 /dev/sdc 745.2G 5 4 BTWA632601RW800HGN
4 /dev/sdd 745.2G 7 6 BTWA6326038Z800HGN
5 /dev/sde 745.2G 9 8 BTWA632605GP800HGN
6 /dev/sdg 745.2G 11 10 BTWA632601U9800HGN
7 /dev/sdh 745.2G 13 12 BTWA632604RJ800HGN
8 /dev/sdi 745.2G 15 14 BTWA632602Q3800HGN
9 /dev/sdj 745.2G 16 17 BTWA63260373800HGN
10 /dev/sdk 745.2G 19 18 BTWA632605EV800HGN
11 /dev/sdl 745.2G 21 20 BTWA63250476800HGN
12 /dev/sdm 745.2G 22 23 BTWA6326047A800HGN
13 /dev/sdn 745.2G 24 25 BTWA632602Q0800HGN
14 /dev/sdo 744.6G 27 26 BTWA632605EV800HGN
--
Disk removal mode:
1: safe
2: force
Enter index: 1
safe mode takes longer by attempting to migrate data on disk(s).
Enter 'YES' to confirm: YES
--
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 ssd 0.36349 1.00000 372 GiB 4.9 GiB 4.9 GiB 2 KiB 21 MiB 367 GiB 1.32 1.00 61 up
TOTAL 372 GiB 4.9 GiB 4.9 GiB 2.7 KiB 21 MiB 367 GiB 1.32
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
OSD(s) 0 have 40 pgs currently mapped to them.
OSD(s) 0 have 36 pgs currently mapped to them.
OSD(s) 0 have 26 pgs currently mapped to them.
OSD(s) 0 have 14 pgs currently mapped to them.
marked down osd.0.
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
1 ssd 0.36349 1.00000 372 GiB 11 GiB 9.9 GiB 2 KiB 1.3 GiB 361 GiB 3.01 1.00 64 up
TOTAL 372 GiB 11 GiB 9.9 GiB 2.7 KiB 1.3 GiB 361 GiB 3.01
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
OSD(s) 1 have 33 pgs currently mapped to them.
OSD(s) 1 have 3 pgs currently mapped to them.
OSD(s) 1 have 2 pgs currently mapped to them.
marked down osd.1.
Removed disk /dev/sda.
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 ssd 0.36349 1.00000 372 GiB 15 GiB 15 GiB 4 KiB 94 MiB 358 GiB 3.95 1.00 76 up
TOTAL 372 GiB 15 GiB 15 GiB 4.8 KiB 94 MiB 358 GiB 3.95
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
OSD(s) 2 have 53 pgs currently mapped to them.
OSD(s) 2 have 46 pgs currently mapped to them.
OSD(s) 2 have 36 pgs currently mapped to them.
OSD(s) 2 have 24 pgs currently mapped to them.
OSD(s) 2 have 8 pgs currently mapped to them.
OSD(s) 2 have 4 pgs currently mapped to them.
Restored /dev/sdb: osd.2 as pgs could not be moved likely due to too few disks or little space in the failure domain.
Failed to remove disk /dev/sdb with safe mode for storage cannot become healthy
without it.
--
Processed 1 disk out of 14.