Bigstack

Bigstack

  • Docs

›Maintenance

Introduction

  • Overview

Installation

  • Control Converged
  • Control
  • Compute node (HCI)
  • Storage node (SDS)
  • NIC configuration

    • Network bonding
    • VLAN
    • VLAN on bonding

    High Availability

    • Network offload
    • Control node
    • Compute node
    • Storage node

Quick Start

  • Get Started

Compute

  • flavors
  • images
  • images(CLI)
  • instances
  • Instance Snapshot

Volumes

  • Volume
  • Snapshots
  • Backup

Network

  • Internet
  • Private Network(LAN)
  • Router
  • Security Group
  • Floating IP
  • Load Balancers

    • Create a Load Balancer
    • Delete a Load Balancer

DNS

  • Create Public Record
  • Create Private Record

Object Store

  • Object Storage
  • Object Storage (S3 API)

File Store

  • Tenant Share

Identity

  • Create a project
  • Modify Quota
  • Create a User
  • Create a group

Storage

  • Add a disk
  • Remove a disk
  • Prepare disk
  • Remove osd
  • Cache
  • Restful API

Disaster Recovery

  • Start scheduler
  • Full Backup
  • Incremental Backup
  • Cleanup
  • Schedule job
  • Increase quota

Kubernetes

    Instance (VM)

    • Create Template
    • Deploy Cluster
    • Storage Class
    • Workload with L4 LoadBalancer

    Compute Node(HCI)

    • Deploy Cluster
    • Storage Class

Monitoring

  • Host
  • Storage
  • Network
  • Instance
  • Top Hosts
  • Top Instances

Logs

  • Alert
  • Notification

    • Email
    • Slack
  • ELK Stack

Maintenance

  • Startup
  • Shutdown
  • Add/Remove a Harddisk
  • Repair Services
  • Fixpack
  • Update CubeOS
  • Support file
  • Export Instance
  • Create CubeOS snapshot
  • Restore CubeOS with snapshot
  • Remove Compute node
  • Remove Storage node

Download

  • Cloud Image

Operating System

  • CentOS 6 - root disk
  • MS Windows

    • Retrieve password
    • Extend partition

Hard disk replacement

Remove a failed hard disk

Story: If you have found a hard disk is failed, You need to remove it from your cluster and recover your storage pool health. To do so,

  • checking storage status before start
  • as shown below, We has 2 osds down due to a failed hard disk, osd no.56,58 on node hostname (s3)
    s1:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_WARN
                2 osds down
                Degraded data redundancy: 1611/44204 objects degraded (3.644%), 106 pgs degraded
    
      services:
        mon: 3 daemons, quorum s1,s2,s3
        mgr: s1(active), standbys: s2, s3
        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
        osd: 60 osds: 58 up, 60 in
        rgw: 3 daemons active
    
      data:
        pools:   22 pools, 5488 pgs
        objects: 21.59k objects, 82.8GiB
        usage:   232GiB used, 92.3TiB / 92.6TiB avail
        pgs:     1611/44204 objects degraded (3.644%)
                 4873 active+clean
                 509  active+undersized
                 106  active+undersized+degraded
    
      io:
        client:   0B/s rd, 1.88MiB/s wr, 2.74kop/s rd, 64op/s wr
        recovery: 5B/s, 0objects/s
        cache:    0op/s promote
    
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | 0  |  s1  | 1597M |  445G |    0   |     0   |    0   |  4096   | exists,up |
    | 1  |  s1  | 1543M |  445G |    0   |   819   |    0   |  1638   | exists,up |
    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
    | 54 |  s3  | 4562M | 1858G |    2   |  64.0k  |    8   |  32.8k  | exists,up |
    | 55 |  s2  | 3784M | 1859G |    6   |   119k  |  429   |  1717k  | exists,up |
    | 56 |  s3  | 3552M | 1859G |    0   |     0   |    0   |     0   |   exists  |
    | 57 |  s2  | 5285M | 1857G |    3   |  49.6k  |   12   |  76.8k  | exists,up |
    | 58 |  s3  | 4921M | 1858G |    0   |     0   |    0   |     0   |   exists  |
    | 59 |  s2  | 3865M | 1859G |    1   |  17.6k  |    2   |  9011   | exists,up |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+

Remove disk

  • connect to the host s3
  • use CLI remove_disk it show /dev/sdj is associated with id 56,58 on index 10.
  • then we remove /dev/sdj from the ceph pool.
  • Remove the Hard disk from the nodes
    s3:storage> remove_disk
      index          name      size   storage ids
    --
          1      /dev/sda    894.3G         21 23
          2      /dev/sdb    894.3G         25 27
          3      /dev/sdc      3.7T         29 31
          4      /dev/sdd      3.7T         33 35
          5      /dev/sde      3.7T         36 38
          6      /dev/sdf      3.7T         41 42
          7      /dev/sdg      3.7T         44 46
          8      /dev/sdh      3.7T         48 50
          9      /dev/sdi      3.7T         52 54
         10      /dev/sdj      3.7T         56 58
    --
    Enter the index of disk to be removed: 10
    Enter 'YES' to confirm: YES
    Remove disk /dev/sdj successfully.
    
  • let's check the status of our storage pool
  • as you see, ceph is recover the data automatically
    s3:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_WARN
                276/44228 objects misplaced (0.624%)
                Degraded data redundancy: 908/44228 objects degraded (2.053%), 68 pgs degraded
    
      services:
        mon: 3 daemons, quorum s1,s2,s3
        mgr: s1(active), standbys: s3, s2
        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
        osd: 58 osds: 58 up, 58 in; 7 remapped pgs
        rgw: 3 daemons active
    
      data:
        pools:   22 pools, 5488 pgs
        objects: 21.60k objects, 82.9GiB
        usage:   227GiB used, 88.7TiB / 88.9TiB avail
        pgs:     908/44228 objects degraded (2.053%)
                 276/44228 objects misplaced (0.624%)
                 5409 active+clean
                 61   active+recovery_wait+degraded
                 7    active+recovering+degraded
                 4    active+recovering
                 4    active+remapped+backfill_wait
                 3    active+undersized+remapped+backfill_wait
    
      io:
        client:   273KiB/s rd, 34.7KiB/s wr, 318op/s rd, 4op/s wr
        recovery: 135MiB/s, 0keys/s, 37objects/s
    

Results:

  • wait for a while and check the status again
  • We had successfully remove the failed hard disk and the health status are OK
    s1:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_OK
    
      services:
        mon: 3 daemons, quorum s1,s2,s3
        mgr: s1(active), standbys: s2, s3
        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
        osd: 58 osds: 58 up, 58 in
        rgw: 3 daemons active
    
      data:
        pools:   22 pools, 5488 pgs
        objects: 21.59k objects, 82.8GiB
        usage:   229GiB used, 88.7TiB / 88.9TiB avail
        pgs:     5488 active+clean
    
      io:
        client:   132KiB/s rd, 5.44KiB/s wr, 159op/s rd, 0op/s wr
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | 0  |  s1  | 1594M |  445G |    0   |     0   |    0   |     0   | exists,up |
    | 1  |  s1  | 1536M |  445G |    0   |     0   |    0   |     0   | exists,up |
    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
    | 54 |  s3  | 4665M | 1858G |    3   |  29.6k  |    0   |     0   | exists,up |
    | 55 |  s2  | 3769M | 1859G |    0   |     0   |    0   |     0   | exists,up |
    | 57 |  s2  | 5366M | 1857G |    0   |   819   |    0   |     0   | exists,up |
    | 59 |  s2  | 3851M | 1859G |    0   |     0   |    0   |     0   | exists,up |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    

Add/Replace a New Disk

Story: We had bought a new hard disk to add in to our storage pool

  • connect to the host (which add the new hard disk)
  • Adding add new disk to the node with CLI add_disk
    s3:storage> add_disk
      index          name      size
    --
          1      /dev/sdj      3.7T
    --
    Found 1 available disk
    Enter the index to add this disk into the pool: 1
    Enter 'YES' to confirm: YES
    Add disk /dev/sdj successfully.
    
  • wait for a moment, auto recovery is started
    s3:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_WARN
                90/44225 objects misplaced (0.204%)
                Degraded data redundancy: 1631/44225 objects degraded (3.688%), 114 pgs degraded
    
      services:
        mon: 3 daemons, quorum s1,s2,s3
        mgr: s1(active), standbys: s3, s2
        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
        osd: 60 osds: 60 up, 60 in; 10 remapped pgs
        rgw: 3 daemons active
    
      data:
        pools:   22 pools, 5488 pgs
        objects: 21.59k objects, 82.9GiB
        usage:   232GiB used, 92.3TiB / 92.6TiB avail
        pgs:     1631/44225 objects degraded (3.688%)
                 90/44225 objects misplaced (0.204%)
                 5362 active+clean
                 113  active+recovery_wait+degraded
                 9    active+remapped+backfill_wait
                 2    active+recovering
                 1    active+remapped+backfilling
                 1    active+recovering+degraded
    
      io:
        client:   0B/s rd, 78.1KiB/s wr, 15op/s rd, 15op/s wr
        recovery: 40.4MiB/s, 12objects/s
    
  • Result: osd size is changed from 58 to 60
    s3:storage> status
      cluster:
        id:     c6e64c49-09cf-463b-9d1c-b6645b4b3b85
        health: HEALTH_OK
    
      services:
        mon: 3 daemons, quorum s1,s2,s3
        mgr: s1(active), standbys: s3, s2
        mds: cephfs-1/1/1 up  {0=s3=up:active}, 2 up:standby
        osd: 60 osds: 60 up, 60 in
        rgw: 3 daemons active
    
      data:
        pools:   22 pools, 5488 pgs
        objects: 21.59k objects, 82.9GiB
        usage:   231GiB used, 92.3TiB / 92.6TiB avail
        pgs:     5488 active+clean
    
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    | 0  |  s1  | 1587M |  445G |    0   |     0   |    0   |     0   | exists,up |
    | 1  |  s1  | 1535M |  445G |    0   |     0   |    0   |     0   | exists,up |
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    | 55 |  s2  | 3769M | 1859G |    0   |     0   |    0   |     0   | exists,up |
    | 56 |  s3  | 3525M | 1859G |    0   |     0   |    0   |     0   | exists,up |
    | 57 |  s2  | 5262M | 1857G |    0   |     0   |    0   |     0   | exists,up |
    | 58 |  s3  | 4895M | 1858G |    0   |     0   |    0   |     0   | exists,up |
    | 59 |  s2  | 3851M | 1859G |    0   |     0   |    0   |     0   | exists,up |
    +----+------+-------+-------+--------+---------+--------+---------+-----------+
    

Remove an osd

Story: One of your node has failed to power up with no reason. the osds that host by the failed node went offline. So we have to recover the storage pool ASAP from the health_warn status. you can do it from any host of your cluster.

  • connect to one of your (live) host
  • start remove osds with CLI remove_osd and remove all failed osds from the list
  • after all failed osds is removed, please check your storage health with CLI storage> status
    s2:storage> remove_osd
    Enter osd id to be removed:
    1:  down (1.81920)
    2:  down (1.81920)
    3: osd.31 (hdd)
    4: osd.35 (hdd)
    5: osd.36 (hdd)
    6: osd.38 (hdd)
    7: osd.41 (hdd)
    8: osd.42 (hdd)
    9: osd.44 (hdd)
    10: osd.46 (hdd)
    11: osd.48 (hdd)
    12: osd.50 (hdd)
    13: osd.52 (hdd)
    14: osd.54 (hdd)
    15: osd.64 (hdd)
    16: osd.65 (hdd)
    17: osd.66 (hdd)
    18: osd.67 (hdd)
    19: osd.68 (hdd)
    20: osd.69 (hdd)
    21: osd.70 (hdd)
    22: osd.71 (hdd)
    23: osd.72 (hdd)
    24: osd.73 (hdd)
    25: osd.74 (hdd)
    26: osd.75 (hdd)
    27: osd.76 (hdd)
    28: osd.77 (hdd)
    29: osd.78 (hdd)
    30: osd.79 (hdd)
    31: osd.21 (ssd)
    32: osd.23 (ssd)
    33: osd.25 (ssd)
    34: osd.27 (ssd)
    35: osd.60 (ssd)
    36: osd.61 (ssd)
    37: osd.62 (ssd)
    38: osd.63 (ssd)
    Enter index: 1
    Enter 'YES' to confirm: YES
    Remove osd.31 successfully.
    

prepare_disk

Story: if you have a failed hard disk and wish to remove it from your storage pool but you cannot unplug hard disk from your node. Without any physical hard disk removal. prepare_disk will remove the hard disk from the storage pool and delete the partitions table, So that it will permanently removed from your storage pool, even you didn't remove the failed hard disk from the server.

  • connect to the host
  • remove the disk with CLI prepare_disk
    s3:storage> prepare_disk
      index          name      size   storage ids
    --
          1      /dev/sda    894.3G         21 23
          2      /dev/sdb    894.3G         25 27
          3      /dev/sdc      3.7T         29 31
          4      /dev/sdd      3.7T         33 35
          5      /dev/sde      3.7T         36 38
          6      /dev/sdf      3.7T         41 42
          7      /dev/sdg      3.7T         44 46
          8      /dev/sdh      3.7T         48 50
          9      /dev/sdi      3.7T         52 54
         10      /dev/sdj      3.7T         56 58
    --
    Enter the index of disk to be removed: 10
    Enter 'YES' to confirm: YES
    Remove disk /dev/sdj successfully.
    
Last updated on 1/7/2020
← ShutdownRepair Services →
  • Remove a failed hard disk
  • Add/Replace a New Disk
  • Remove an osd
  • prepare_disk
Copyright © 2021 Bigstack Co., Ltd.