Hard disk replacement
Remove a failed hard disk
Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health. To do so,
- checking storage status before start
- as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)
s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
2 osds down
Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded
services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 60 osds: 58 up, 60 in
rgw: 3 daemons active
data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 232GiB used, 92.3TiB / 92.6TiB avail
pgs: 1611/44204 objects degraded (3.644%)
4873 active+clean
509 active+undersized
106 active+undersized+degraded
io:
client: 0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr
recovery: 5B/s, 0objects/s
cache: 0op/s promote
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1597M | 445G | 0 | 0 | 0 | 4096 | exists,up |
| 1 | s1 | 1543M | 445G | 0 | 819 | 0 | 1638 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4562M | 1858G | 2 | 64.0k | 8 | 32.8k | exists,up |
| 55 | s2 | 3784M | 1859G | 6 | 119k | 429 | 1717k | exists,up |
| 56 | s3 | 3552M | 1859G | 0 | 0 | 0 | 0 | exists |
| 57 | s2 | 5285M | 1857G | 3 | 49.6k | 12 | 76.8k | exists,up |
| 58 | s3 | 4921M | 1858G | 0 | 0 | 0 | 0 | exists |
| 59 | s2 | 3865M | 1859G | 1 | 17.6k | 2 | 9011 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
Remove disk
- connect to the host s3
- use CLI
remove_disk
it show/dev/sdj
is associated with id 56,58 on index 10. - then we remove
/dev/sdj
from the ceph pool. - Remove the Hard disk from the nodes
s3:storage> remove_disk
index name size storage ids
--
1 /dev/sda 894.3G 21 23
2 /dev/sdb 894.3G 25 27
3 /dev/sdc 3.7T 29 31
4 /dev/sdd 3.7T 33 35
5 /dev/sde 3.7T 36 38
6 /dev/sdf 3.7T 41 42
7 /dev/sdg 3.7T 44 46
8 /dev/sdh 3.7T 48 50
9 /dev/sdi 3.7T 52 54
10 /dev/sdj 3.7T 56 58
--
Enter the index of disk to be removed: 10
Enter 'YES' to confirm: YES
Remove disk /dev/sdj successfully. - let's check the status of our storage pool
- as you see, ceph is recover the data automatically
s3:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
276/44228 objects misplaced (0.624%)
Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded
services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s3, s2
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 58 osds: 58 up, 58 in; 7 remapped pgs
rgw: 3 daemons active
data:
pools: 22 pools, 5488 pgs
objects: 21.60k objects, 82.9GiB
usage: 227GiB used, 88.7TiB / 88.9TiB avail
pgs: 908/44228 objects degraded (2.053%)
276/44228 objects misplaced (0.624%)
5409 active+clean
61 active+recovery_wait+degraded
7 active+recovering+degraded
4 active+recovering
4 active+remapped+backfill_wait
3 active+undersized+remapped+backfill_wait
io:
client: 273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr
recovery: 135MiB/s, 0keys/s, 37objects/s
Results:
- wait for a while and check the
status
again - We had successfully remove the failed hard disk and the health status are OK
s1:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK
services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s2, s3
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 58 osds: 58 up, 58 in
rgw: 3 daemons active
data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.8GiB
usage: 229GiB used, 88.7TiB / 88.9TiB avail
pgs: 5488 active+clean
io:
client: 132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1594M | 445G | 0 | 0 | 0 | 0 | exists,up |
| 1 | s1 | 1536M | 445G | 0 | 0 | 0 | 0 | exists,up |
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
| 54 | s3 | 4665M | 1858G | 3 | 29.6k | 0 | 0 | exists,up |
| 55 | s2 | 3769M | 1859G | 0 | 0 | 0 | 0 | exists,up |
| 57 | s2 | 5366M | 1857G | 0 | 819 | 0 | 0 | exists,up |
| 59 | s2 | 3851M | 1859G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
Add/Replace a New Disk
Story: We had bought a new hard disk to add in to our storage pool
- connect to the host (which add the new hard disk)
- Adding add new disk to the node with CLI
add_disk
s3:storage> add_disk
index name size
--
1 /dev/sdj 3.7T
--
Found 1 available disk
Enter the index to add this disk into the pool: 1
Enter 'YES' to confirm: YES
Add disk /dev/sdj successfully. - wait for a moment, auto recovery is started
s3:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_WARN
90/44225 objects misplaced (0.204%)
Degraded data redundancy: 1631/44225 objects degraded (3.688%), 114 pgs degraded
services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s3, s2
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 60 osds: 60 up, 60 in; 10 remapped pgs
rgw: 3 daemons active
data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.9GiB
usage: 232GiB used, 92.3TiB / 92.6TiB avail
pgs: 1631/44225 objects degraded (3.688%)
90/44225 objects misplaced (0.204%)
5362 active+clean
113 active+recovery_wait+degraded
9 active+remapped+backfill_wait
2 active+recovering
1 active+remapped+backfilling
1 active+recovering+degraded
io:
client: 0B/s rd, 78.1KiB/s wr, 15op/s rd, 15op/s wr
recovery: 40.4MiB/s, 12objects/s - Result: osd size is changed from 58 to 60
s3:storage> status
cluster:
id: c6e64c49-09cf-463b-9d1c-b6645b4b3b85
health: HEALTH_OK
services:
mon: 3 daemons, quorum s1,s2,s3
mgr: s1(active), standbys: s3, s2
mds: cephfs-1/1/1 up {0=s3=up:active}, 2 up:standby
osd: 60 osds: 60 up, 60 in
rgw: 3 daemons active
data:
pools: 22 pools, 5488 pgs
objects: 21.59k objects, 82.9GiB
usage: 231GiB used, 92.3TiB / 92.6TiB avail
pgs: 5488 active+clean
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 1587M | 445G | 0 | 0 | 0 | 0 | exists,up |
| 1 | s1 | 1535M | 445G | 0 | 0 | 0 | 0 | exists,up |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| 55 | s2 | 3769M | 1859G | 0 | 0 | 0 | 0 | exists,up |
| 56 | s3 | 3525M | 1859G | 0 | 0 | 0 | 0 | exists,up |
| 57 | s2 | 5262M | 1857G | 0 | 0 | 0 | 0 | exists,up |
| 58 | s3 | 4895M | 1858G | 0 | 0 | 0 | 0 | exists,up |
| 59 | s2 | 3851M | 1859G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
Remove an osd
Story: One of your node has failed to power up with no reason. the osds that host by the failed node went offline. So we have to recover the storage pool ASAP from the health_warn
status. you can do it from any host of your cluster.
- connect to one of your (live) host
- start remove osds with CLI
remove_osd
and remove all failed osds from the list - after all failed osds is removed, please check your storage health with CLI
storage> status
s2:storage> remove_osd
Enter osd id to be removed:
1: down (1.81920)
2: down (1.81920)
3: osd.31 (hdd)
4: osd.35 (hdd)
5: osd.36 (hdd)
6: osd.38 (hdd)
7: osd.41 (hdd)
8: osd.42 (hdd)
9: osd.44 (hdd)
10: osd.46 (hdd)
11: osd.48 (hdd)
12: osd.50 (hdd)
13: osd.52 (hdd)
14: osd.54 (hdd)
15: osd.64 (hdd)
16: osd.65 (hdd)
17: osd.66 (hdd)
18: osd.67 (hdd)
19: osd.68 (hdd)
20: osd.69 (hdd)
21: osd.70 (hdd)
22: osd.71 (hdd)
23: osd.72 (hdd)
24: osd.73 (hdd)
25: osd.74 (hdd)
26: osd.75 (hdd)
27: osd.76 (hdd)
28: osd.77 (hdd)
29: osd.78 (hdd)
30: osd.79 (hdd)
31: osd.21 (ssd)
32: osd.23 (ssd)
33: osd.25 (ssd)
34: osd.27 (ssd)
35: osd.60 (ssd)
36: osd.61 (ssd)
37: osd.62 (ssd)
38: osd.63 (ssd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.31 successfully.
prepare_disk
Story: if you have a failed hard disk and wish to remove it from your storage pool but you cannot unplug hard disk from your node. Without any physical hard disk removal. prepare_disk
will remove the hard disk from the storage pool and delete the partitions table, So that it will permanently removed from your storage pool, even you didn't remove the failed hard disk from the server.
- connect to the host
- remove the disk with CLI
prepare_disk
s3:storage> prepare_disk
index name size storage ids
--
1 /dev/sda 894.3G 21 23
2 /dev/sdb 894.3G 25 27
3 /dev/sdc 3.7T 29 31
4 /dev/sdd 3.7T 33 35
5 /dev/sde 3.7T 36 38
6 /dev/sdf 3.7T 41 42
7 /dev/sdg 3.7T 44 46
8 /dev/sdh 3.7T 48 50
9 /dev/sdi 3.7T 52 54
10 /dev/sdj 3.7T 56 58
--
Enter the index of disk to be removed: 10
Enter 'YES' to confirm: YES
Remove disk /dev/sdj successfully.