Skip to main content
Version: 2.4

Hard disk replacement

Remove a failed hard disk#

Storage status#

Story: If you discover a failed hard disk, it is essential to remove it from your cluster and restore the health of your storage pool. To proceed:

  • Check the storage status before starting:
  • As shown below, two OSDs (Object Storage Daemons) are down due to the failed hard disk—OSD numbers 4 and 5 on the node with the hostname sky141.
    sky141:storage> status      cluster:        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85        health: HEALTH_WARN                2 osds down                Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded            services:          mon: 3 daemons, quorum  sky141,sky142,sky143 (age 8d)          mgr:  sky141(active, since 8d), standbys: sky142, sky143          mds: 1/1 daemons up, 1 standby, 1 hot standby          osd: 18 osds: 16 up (since 2d), 18 in (since 2d)          rgw: 3 daemons active (3 hosts, 1 zones)
        data:          volumes: 1/1 healthy          pools:   25 pools, 753 pgs          objects: 149.93k objects, 785 GiB          usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail          pgs:     753 active+clean
        io:          client:   5.3 MiB/s rd, 308 KiB/s wr, 171 op/s rd, 56 op/s wr
      ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      0   sky141  65.7G   380G      6     40.0k      7      161k  exists,up      1   sky141   181G   265G      8     32.7k      1     58.4k  exists,up      2   sky141   162G   283G      0     4096      15      604k  exists,up      3   sky141   133G   313G      0     1638       2     29.6k  exists,up      4   sky141  91.8G   354G     14     97.5k      6     39.2k  exists      5   sky141   130G   315G      8     39.1k      3     88.9k  exists      6   sky142  96.0G   350G      9     50.3k      3      160k  exists,up      7   sky142   165G   281G      0        0       1     89.6k  exists,up      8   sky142  75.8G   370G      0     6553       1     25.6k  exists,up      9   sky142   199G   247G      0     3276       3      172k  exists,up      10  sky142   122G   324G      2     13.5k      9      510k  exists,up      11  sky142  95.3G   351G      1     4096       6      126k  exists,up      12  sky143   184G   262G      3     12.0k      1     25.6k  exists,up      13  sky143  93.6G   353G      0        0       0     5734   exists,up      14  sky143  67.8G   378G     12     71.1k     13      364k  exists,up      15  sky143  92.6G   354G      0      819       0        0   exists,up      16  sky143   142G   303G      0      819       2     24.0k  exists,up      17  sky143   179G   267G      0     2457       5     99.2k  exists,up

Remove disk#

  • connect to the host sky141

  • use CLI remove_disk it show /dev/sde is associated with id 4,5 on index 3.

  • then we remove /dev/sde from the ceph pool.

  • Remove the Hard disk from the nodes

    sky141:storage> remove_disk  index          name      size     osd              serial--      1      /dev/sda    894.3G     0 1      S40FNA0M800607      2      /dev/sdc    894.3G     2 3      S40FNA0M800598      3      /dev/sde    894.3G     4 5      S40FNA0M800608--Enter the index of disk to be removed: 3Disk removal mode (safe/force): forceforce mode immediately destroys disk data without taking into accounts ofstorage status so USE IT AT YOUR OWN RISK.Enter 'YES' to confirm: YES
  • let's check the status of our storage pool, ceph is recovering the data automatically

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            Degraded data redundancy: 6075/438706 objects degraded (1.385%), 8 pgs degraded, 8 pgs undersized
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 16 osds: 16 up (since 10m), 16 in (since 10m); 15 remapped pgs    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.94k objects, 785 GiB    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail    pgs:     6075/438706 objects degraded (1.385%)            5463/438706 objects misplaced (1.245%)            738 active+clean            8   active+undersized+degraded+remapped+backfilling            7   active+remapped+backfilling
      io:    client:   4.4 MiB/s rd, 705 KiB/s wr, 87 op/s rd, 83 op/s wr    recovery: 127 MiB/s, 27 objects/s
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   141G   305G      1     28.0k      1     42.4k  exists,up1   sky141   177G   268G     11     88.0k      3     26.3k  exists,up2   sky141   212G   233G      2     12.7k      0        0   exists,up3   sky141   193G   253G      3     31.1k      7      634k  exists,up6   sky142  86.0G   360G      9     40.0k      2     27.1k  exists,up7   sky142   179G   267G      7      184k      2      119k  exists,up8   sky142  90.8G   355G      0     18.3k     19     1553k  exists,up9   sky142   201G   245G      8     35.1k     16     1450k  exists,up10  sky142   108G   337G      6     51.1k     11      755k  exists,up11  sky142  98.5G   348G      0     6553       2     41.6k  exists,up12  sky143   201G   245G     16      100k      3      230k  exists,up13  sky143   122G   323G      0        0       0        0   exists,up14  sky143  88.0G   358G     15     76.0k     47     2970k  exists,up15  sky143   100G   346G      7      183k     14     1286k  exists,up16  sky143   127G   319G      5     28.0k     15      659k  exists,up17  sky143   132G   314G     23      225k      9      491k  exists,up

Results:#

  • wait for a while and check the status again

  • We had successfully remove the failed hard disk and the health status are OK

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 16 osds: 16 up (since 21m), 16 in (since 21m)    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.99k objects, 786 GiB    usage:   2.2 TiB used, 4.8 TiB / 7.0 TiB avail    pgs:     753 active+clean
      io:    client:   25 KiB/s rd, 304 KiB/s wr, 19 op/s rd, 43 op/s wr
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   148G   298G      0        0       0        0   exists,up1   sky141   176G   269G      5     23.1k      1        0   exists,up2   sky141   202G   243G      0     28.0k      1        0   exists,up3   sky141   220G   225G      0     3276       0        0   exists,up6   sky142  86.1G   360G      4     20.7k      0        0   exists,up7   sky142   180G   266G      0        0       0        0   exists,up8   sky142  89.2G   357G      7     49.5k      2     10.3k  exists,up9   sky142   201G   245G      0      819       0        0   exists,up10  sky142   108G   337G      1     7372       0     5734   exists,up11  sky142  99.1G   347G      0     12.7k      0        0   exists,up12  sky143   199G   247G      1     5734       1        0   exists,up13  sky143   112G   333G      4     22.3k      0        0   exists,up14  sky143  86.3G   360G      1     18.3k      2       90   exists,up15  sky143  98.7G   347G      0     16.0k      1        0   exists,up16  sky143   128G   318G      1     4915       2     9027   exists,up17  sky143   141G   305G      2     22.3k      0        0   exists,up

Add a new disk#

Story: We had bought a new hard disk to add in to our storage pool

  • connect to the host (which add the new hard disk)

  • Adding add new disk to the node with CLI add_disk

    sky141:storage> add_disk  index          name      size              serial--      1      /dev/sde    894.3G     S40FNA0M800608--Found 1 available disksEnter the index to add this disk into the pool: 1Enter 'YES' to confirm: YESAdd disk /dev/sde successfully.
  • wait for a moment, auto recovery is started

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_WARN            Degraded data redundancy: 277/438826 objects degraded (0.063%), 1 pg degraded
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 18 osds: 18 up (since 31s), 18 in (since 31s); 89 remapped pgs    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 149.99k objects, 786 GiB    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail    pgs:     277/438826 objects degraded (0.063%)            82685/438826 objects misplaced (18.842%)            660 active+clean            51  active+remapped+backfilling            39  active+remapped+backfill_wait            2   active+remapped            1   active+undersized+degraded+remapped+backfilling
      io:    client:   727 KiB/s rd, 438 KiB/s wr, 31 op/s rd, 55 op/s wr    recovery: 873 MiB/s, 2 keys/s, 165 objects/s
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141   149G   297G      0     6553       0        0   exists,up1   sky141   175G   271G      0     4096       1     18.6k  exists,up2   sky141   202G   244G      0     66.3k      3     34.4k  exists,up3   sky141   221G   225G      4     16.0k      1     17.0k  exists,up4   sky141  6397M   440G      0        0       0        0   exists,up5   sky141  3304M   443G      0        0       0       72   exists,up6   sky142  86.3G   360G      3     15.4k      3     60.5k  exists,up7   sky142   180G   266G      0      585       1     8192   exists,up8   sky142  89.2G   357G      0     23.1k      5      128k  exists,up9   sky142   201G   245G     11     70.3k      3     46.4k  exists,up10  sky142   109G   337G      1     17.1k      1     28.0k  exists,up11  sky142  99.1G   347G      0        0       0        0   exists,up12  sky143   200G   245G      3     16.0k      4      193k  exists,up13  sky143   114G   332G      0     4915       0        0   exists,up14  sky143  86.3G   360G      5     41.0k     12      303k  exists,up15  sky143  99.3G   347G      2     89.3k      5      129k  exists,up16  sky143   128G   317G      0     2457       1       16   exists,up17  sky143   143G   303G     13     84.0k      1     7372   exists,up
  • Result: osd size is changed from 16 to 18

    sky141:storage> status  cluster:    id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85    health: HEALTH_OK
      services:    mon: 3 daemons, quorum sky141,sky142,sky143 (age 8d)    mgr: sky141(active, since 8d), standbys: sky142, sky143    mds: 1/1 daemons up, 1 standby, 1 hot standby    osd: 18 osds: 18 up (since 76m), 18 in (since 76m)    rgw: 3 daemons active (3 hosts, 1 zones)
      data:    volumes: 1/1 healthy    pools:   25 pools, 753 pgs    objects: 150.06k objects, 786 GiB    usage:   2.2 TiB used, 5.6 TiB / 7.9 TiB avail    pgs:     753 active+clean
      io:    client:   1.3 KiB/s rd, 447 KiB/s wr, 26 op/s rd, 59 op/s wr
    ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE0   sky141  64.9G   381G      3     30.3k      0        0   exists,up1   sky141   180G   266G      8     40.7k      0        0   exists,up2   sky141   161G   285G      0     8192       0        0   exists,up3   sky141   131G   314G      3     16.0k      0        0   exists,up4   sky141  91.1G   355G      1     7372       1        0   exists,up5   sky141   130G   316G      0        0       1       90   exists,up6   sky142  96.3G   350G      5     23.1k      0        0   exists,up7   sky142   165G   281G      0     7372       0        0   exists,up8   sky142  76.1G   370G      0        0       1        0   exists,up9   sky142   199G   247G      0     6553       0        0   exists,up10  sky142   122G   324G      2     9011       0        0   exists,up11  sky142  95.5G   351G      2     21.5k      0        0   exists,up12  sky143   184G   262G      2     35.1k      1        0   exists,up13  sky143  95.7G   350G      0        0       0        0   exists,up14  sky143  66.4G   380G      9     44.0k      1        0   exists,up15  sky143  92.9G   353G      1     9011       0        0   exists,up16  sky143   143G   303G      0     6553       1       16   exists,up17  sky143   179G   267G      7     52.0k      1      102   exists,up

Remove an osd#

Story: One of your nodes has failed to power up for no apparent reason, causing the OSDs hosted by the failed node to go offline. To restore the storage pool from a health_warn status as quickly as possible, you need to take action from any active host in your cluster.

  • Connect to one of your live hosts.
  • Start removing the OSDs by Using the command storage > remove_osd to remove all failed OSDs from the list.
  • Verify the OSDs are removed with the storage > status command.
    s2:storage> remove_osdEnter osd id to be removed:1:  down (1.81920)2:  down (1.81920)3: osd.31 (hdd)4: osd.35 (hdd)5: osd.36 (hdd)6: osd.38 (hdd)7: osd.41 (hdd)8: osd.42 (hdd)9: osd.44 (hdd)10: osd.46 (hdd)11: osd.48 (hdd)12: osd.50 (hdd)13: osd.52 (hdd)14: osd.54 (hdd)15: osd.64 (hdd)16: osd.65 (hdd)17: osd.66 (hdd)18: osd.67 (hdd)19: osd.68 (hdd)20: osd.69 (hdd)21: osd.70 (hdd)22: osd.71 (hdd)23: osd.72 (hdd)24: osd.73 (hdd)25: osd.74 (hdd)26: osd.75 (hdd)27: osd.76 (hdd)28: osd.77 (hdd)29: osd.78 (hdd)30: osd.79 (hdd)31: osd.21 (ssd)32: osd.23 (ssd)33: osd.25 (ssd)34: osd.27 (ssd)35: osd.60 (ssd)36: osd.61 (ssd)37: osd.62 (ssd)38: osd.63 (ssd)Enter index: 1Enter 'YES' to confirm: YESRemove osd.31 successfully.
Last updated on by Roy Tan