Skip to main content
Version: 2.4

Hard disk replacement

Remove a failed hard disk#

Storage status#

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health. To do so,

  • checking storage status before start
  • as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)
    sky141:storage> status      cluster:        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85        health: HEALTH_WARN                2 osds down                Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded            services:          mon: 3 daemons, quorum  sky141,sky142,sky143 (age 8d)          mgr:  sky141(active, since 8d), standbys: sky142, sky143          mds: 1/1 daemons up, 1 standby, 1 hot standby          osd: 18 osds: 16 up (since 2d), 18 in (since 2d)          rgw: 3 daemons active (3 hosts, 1 zones)
        data:          volumes: 1/1 healthy          pools:   25 pools, 753 pgs          objects: 149.93k objects, 785 GiB          usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail          pgs:     753 active+clean
        io:          client:   5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr
      ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      0   sky141  65.7G   380G      6     40.0k      7      161k  exists,up      1   sky141   181G   265G      8     32.7k      1     58.4k  exists,up      2   sky141   162G   283G      0     4096      15      604k  exists,up      3   sky141   133G   313G      0     1638       2     29.6k  exists,up      4   sky141  91.8G   354G     14     97.5k      6     39.2k  exists      5   sky141   130G   315G      8     39.1k      3     88.9k  exists      6   sky142  96.0G   350G      9     50.3k      3      160k  exists,up      7   sky142   165G   281G      0        0       1     89.6k  exists,up      8   sky142  75.8G   370G      0     6553       1     25.6k  exists,up      9   sky142   199G   247G      0     3276       3      172k  exists,up      10  sky142   122G   324G      2     13.5k      9      510k  exists,up      11  sky142  95.3G   351G      1     4096       6      126k  exists,up      12  sky143   184G   262G      3     12.0k      1     25.6k  exists,up      13  sky143  93.6G   353G      0        0       0     5734   exists,up      14  sky143  67.8G   378G     12     71.1k     13      364k  exists,up      15  sky143  92.6G   354G      0      819       0        0   exists,up      16  sky143   142G   303G      0      819       2     24.0k  exists,up      17  sky143   179G   267G      0     2457       5     99.2k  exists,up

Remove disk#

  • connect to the host s3

  • use CLI remove_disk it show /dev/sdj is associated with id 56,58 on index 10.

  • then we remove /dev/sdj from the ceph pool.

  • Remove the Hard disk from the nodes

    sky141:storage> remove_disk  index          name      size     osd              serial--      1      /dev/sda    894.3G     0 1      S40FNA0M800607      2      /dev/sdc    894.3G     2 3      S40FNA0M800598      3      /dev/sde    894.3G     4 5      S40FNA0M800608--Enter the index of disk to be removed: 3Disk removal mode (safe/force): forceforce mode immediately destroys disk data without taking into accounts ofstorage status so USE IT AT YOUR OWN RISK.Enter 'YES' to confirm: YES
  • let's check the status of our storage pool, ceph is recovering the data automatically

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.94k objects, 785 GiB    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail    pgs:     6075/438706 objects degraded (1.385%)            5463/438706 objects misplaced (1.245%)            738 active+clean            8   active+undersized+degraded+remapped+backfilling            7   active+remapped+backfilling
      io:    client:   4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr    recovery: 127 MiB/s, 27 objects/s
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   141G   305G      1     28.0k      1     42.4k  exists,up1   sky141   177G   268G     11     88.0k      3     26.3k  exists,up2   sky141   212G   233G      2     12.7k      0        0   exists,up3   sky141   193G   253G      3     31.1k      7      634k  exists,up6   sky142  86.0G   360G      9     40.0k      2     27.1k  exists,up7   sky142   179G   267G      7      184k      2      119k  exists,up8   sky142  90.8G   355G      0     18.3k     19     1553k  exists,up9   sky142   201G   245G      8     35.1k     16     1450k  exists,up10  sky142   108G   337G      6     51.1k     11      755k  exists,up11  sky142  98.5G   348G      0     6553       2     41.6k  exists,up12  sky143   201G   245G     16      100k      3      230k  exists,up13  sky143   122G   323G      0        0       0        0   exists,up14  sky143  88.0G   358G     15     76.0k     47     2970k  exists,up15  sky143   100G   346G      7      183k     14     1286k  exists,up16  sky143   127G   319G      5     28.0k     15      659k  exists,up17  sky143   132G   314G     23      225k      9      491k  exists,up

Results:#

  • wait for a while and check the status again

  • We had successfully remove the failed hard disk and the health status are OK

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.99k objects, 786 GiB    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail    pgs:     753 active+clean
      io:    client:   25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   148G   298G      0        0       0        0   exists,up1   sky141   176G   269G      5     23.1k      1        0   exists,up2   sky141   202G   243G      0     28.0k      1        0   exists,up3   sky141   220G   225G      0     3276       0        0   exists,up6   sky142  86.1G   360G      4     20.7k      0        0   exists,up7   sky142   180G   266G      0        0       0        0   exists,up8   sky142  89.2G   357G      7     49.5k      2     10.3k  exists,up9   sky142   201G   245G      0      819       0        0   exists,up10  sky142   108G   337G      1     7372       0     5734   exists,up11  sky142  99.1G   347G      0     12.7k      0        0   exists,up12  sky143   199G   247G      1     5734       1        0   exists,up13  sky143   112G   333G      4     22.3k      0        0   exists,up14  sky143  86.3G   360G      1     18.3k      2       90   exists,up15  sky143  98.7G   347G      0     16.0k      1        0   exists,up16  sky143   128G   318G      1     4915       2     9027   exists,up17  sky143   141G   305G      2     22.3k      0        0   exists,up

Add a new disk#

Story: We had bought a new hard disk to add in to our storage pool

  • connect to the host (which add the new hard disk)

  • Adding add new disk to the node with CLI add_disk

    sky141:storage> add_disk  index          name      size              serial--      1      /dev/sde    894.3G     S40FNA0M800608--Found 5 available disksEnter the index to add this disk into the pool: 2Enter 'YES' to confirm: YESAdd disk /dev/sde successfully.
  • wait for a moment, auto recovery is started

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.99k objects, 786 GiB    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail    pgs:     277/438826 objects degraded (0.063%)            82685/438826 objects misplaced (18.842%)            660 active+clean            51  active+remapped+backfilling            39  active+remapped+backfill_wait            2   active+remapped            1   active+undersized+degraded+remapped+backfilling
      io:    client:   727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr    recovery: 873 MiB/s, 2 keys/s, 165 objects/s
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   149G   297G      0     6553       0        0   exists,up1   sky141   175G   271G      0     4096       1     18.6k  exists,up2   sky141   202G   244G      0     66.3k      3     34.4k  exists,up3   sky141   221G   225G      4     16.0k      1     17.0k  exists,up4   sky141  6397M   440G      0        0       0        0   exists,up5   sky141  3304M   443G      0        0       0       72   exists,up6   sky142  86.3G   360G      3     15.4k      3     60.5k  exists,up7   sky142   180G   266G      0      585       1     8192   exists,up8   sky142  89.2G   357G      0     23.1k      5      128k  exists,up9   sky142   201G   245G     11     70.3k      3     46.4k  exists,up10  sky142   109G   337G      1     17.1k      1     28.0k  exists,up11  sky142  99.1G   347G      0        0       0        0   exists,up12  sky143   200G   245G      3     16.0k      4      193k  exists,up13  sky143   114G   332G      0     4915       0        0   exists,up14  sky143  86.3G   360G      5     41.0k     12      303k  exists,up15  sky143  99.3G   347G      2     89.3k      5      129k  exists,up16  sky143   128G   317G      0     2457       1       16   exists,up17  sky143   143G   303G     13     84.0k      1     7372   exists,up
  • Result: osd size is changed from 16 to 18

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 150.06k objects, 786 GiB    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail    pgs:     753 active+clean
      io:    client:   1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141  64.9G   381G      3     30.3k      0        0   exists,up1   sky141   180G   266G      8     40.7k      0        0   exists,up2   sky141   161G   285G      0     8192       0        0   exists,up3   sky141   131G   314G      3     16.0k      0        0   exists,up4   sky141  91.1G   355G      1     7372       1        0   exists,up5   sky141   130G   316G      0        0       1       90   exists,up6   sky142  96.3G   350G      5     23.1k      0        0   exists,up7   sky142   165G   281G      0     7372       0        0   exists,up8   sky142  76.1G   370G      0        0       1        0   exists,up9   sky142   199G   247G      0     6553       0        0   exists,up10  sky142   122G   324G      2     9011       0        0   exists,up11  sky142  95.5G   351G      2     21.5k      0        0   exists,up12  sky143   184G   262G      2     35.1k      1        0   exists,up13  sky143  95.7G   350G      0        0       0        0   exists,up14  sky143  66.4G   380G      9     44.0k      1        0   exists,up15  sky143  92.9G   353G      1     9011       0        0   exists,up16  sky143   143G   303G      0     6553       1       16   exists,up17  sky143   179G   267G      7     52.0k      1      102   exists,up

Remove an osd#

Story: One of your node has failed to power up with no reason. the osds that host by the failed node went offline. So we have to recover the storage pool ASAP from the health_warn status. you can do it from any host of your cluster.

  • connect to one of your (live) host
  • start remove osds with CLI remove_osd and remove all failed osds from the list
  • after all failed osds is removed, please check your storage health with CLI storage> status
    s2:storage> remove_osdEnter osd id to be removed:1:  down (1.81920)2:  down (1.81920)3: osd.31 (hdd)4: osd.35 (hdd)5: osd.36 (hdd)6: osd.38 (hdd)7: osd.41 (hdd)8: osd.42 (hdd)9: osd.44 (hdd)10: osd.46 (hdd)11: osd.48 (hdd)12: osd.50 (hdd)13: osd.52 (hdd)14: osd.54 (hdd)15: osd.64 (hdd)16: osd.65 (hdd)17: osd.66 (hdd)18: osd.67 (hdd)19: osd.68 (hdd)20: osd.69 (hdd)21: osd.70 (hdd)22: osd.71 (hdd)23: osd.72 (hdd)24: osd.73 (hdd)25: osd.74 (hdd)26: osd.75 (hdd)27: osd.76 (hdd)28: osd.77 (hdd)29: osd.78 (hdd)30: osd.79 (hdd)31: osd.21 (ssd)32: osd.23 (ssd)33: osd.25 (ssd)34: osd.27 (ssd)35: osd.60 (ssd)36: osd.61 (ssd)37: osd.62 (ssd)38: osd.63 (ssd)Enter index: 1Enter 'YES' to confirm: YESRemove osd.31 successfully.
Last updated on by Roy Tan