Version: 2.5

Replacing a Failed Controller Node

In our CubeCOS cluster, one of the controller nodes had a hardware failure and could no longer function. Since controller nodes are critical for managing the cloud, we needed to replace it quickly to keep the system running smoothly.

We removed the failed node from the cluster and prepared a new server with the same hostname and base configuration. After reinstalling the OS, we followed a step-by-step process to restore the CubeCOS services and rejoin the node back into the cluster.

This guide applies to the following roles:

control
control-converged
edge-core
moderator

It explains how to:

Rebuild the node
Reconfigure OpenStack services
Rejoin the node to the HA cluster
Verify the system is working as expected

Remove osd from storage pool

Connect to any healthy node in the cluster.
Run the remove_osd command to remove each failed OSD.
Repeat the process until all missing OSDs have been removed.
Use status to verify that all offline OSDs are gone.

  cc2:storage> remove_osd
  Enter osd id to be removed:
  1: osd.0 (hdd)
  2: osd.1 (hdd)
  3: osd.2 (hdd)
  4: osd.3 (hdd)
  5: osd.4 (hdd)
  6: osd.5 (hdd)
  7: osd.6 (hdd)
  8: osd.7 (hdd)
  Enter index: 1
  Enter 'YES' to confirm: YES
  Remove osd.0 successfully.

Replace a new node

Choose `Advanced` option

First Time Setup Options:
1: Wizard
2: Advanced
Enter index: 2

Control node must set the flag to rejoin

Welcome to the Cube Appliance
Enter "help" for a list of available commands
unconfigured> first
unconfigured:first> control_rejoin
Set or clear control rejoin flag?
1: set
2: clear
Enter index: 1
Control rejoin markers set

Pull snapshot from media

unconfigured> snapshot
unconfigured:snapshot> pull
Select a media:
1: usb
2: nfs
Enter index: 1
Insert a USB drive into the USB port on the appliance.
Enter 'YES' to confirm: YES
1: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 1
Copying...
Automatically generated on 2024-12-08 13:04:32
Copy complete. It is safe to remove the USB drive.

Apply the setting

unconfigured:snapshot> apply
1: CUBE_2.4.3_20241208-054308.133092_unconfigured.snapshot
2: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 2
Automatically generated on 2024-12-08 13:04:32
Date/Time is important for applying changes to an unconfigured box.
Please confirm the current time is good.

  * Local Time: 12/08/2024 00:46:58 EST

Enter 'YES' to confirm: YES
Policy snapshot file applied

Re-log as admin

unconfigured:snapshot> exit
cc1 Login: admin
Password:
Welcome to the Cube Appliance
Enter "help" for a list of available commands
Notice: your license will expire in 29 days.
        Please contact system administrator to renew the license.
cc1>

Switch to manual boot

cc1:boot_mode> manual
Switch to manual bootstrap mode:
1: one-time manual
2: always manual
Enter index: 1
Enter 'YES' to confirm: YES
Exit current CLI and log in again as admin to see changes in effect

Exit Admin CLI and relog

cc1:boot_mode> exit
# su admin
Welcome to the Cube Appliance
Enter "help" for a list of available commands
License (type: trial) is valid for 102 days
cc1>

Cluster Sync

cc1> boot cluster_sync
(1/9) processing: ceph
(2/9) processing: keystone
(3/9) processing: neutron
(4/9) processing: nova
(5/9) processing: cinder
(6/9) processing: manila
(7/9) processing: octavia
(8/9) processing: haproxy
(9/9) processing: telegraf
cluster_sync successfully

Check and Repair services

cc1> cluster check_repair
          Service  Status  Report
      ClusterLink      ok  [ link(v) clock(v) dns(v) ]
       ClusterSys      ok  [ bootstrap(v) license(v) ]
  ClusterSettings      ok  [ etcd(v) ]
        HaCluster  FIXING  [ hacluster(3) ]
                       ok  [ hacluster(f) ]
         MsgQueue      ok  [ rabbitmq(v) ]
           IaasDb      ok  [ mysql(v) ]
        VirtualIp      ok  [ vip(v) haproxy_ha(v) ]
          Storage      ok  [ ceph(v) ceph_mon(v) ceph_mgr(v) ceph_mds(v) ceph_osd(v) ceph_rgw(v) rbd_target(v) ]
       ApiService      ok  [ haproxy(v) httpd(v) lmi(v) memcache(v) ]
     SingleSignOn      ok  [ keycloak(v) ]
          Compute  FIXING  [ nova(8) ]
                       ok  [ nova(f) ]
        Baremetal      ok  [ ironic(v) ]
          Network  FIXING  [ neutron(3) ]
                       ok  [ neutron(f) ]
            Image      ok  [ glance(v) ]
        BlockStor      ok  [ cinder(v) ]
         FileStor      ok  [ manila(v) ]
       ObjectStor      ok  [ swift(v) ]
    Orchestration      ok  [ heat(v) ]
            LBaaS      ok  [ octavia(v) ]
           DNSaaS      ok  [ designate(v) ]
           K8SaaS      ok  [ k3s(v) rancher(v) ]
       InstanceHa      ok  [ masakari(v) ]
    BusinessLogic      ok  [ senlin(v) watcher(v) ]
         DataPipe      ok  [ zookeeper(v) kafka(v) ]
          Metrics      ok  [ monasca(v) telegraf(v) grafana(v) ]
     LogAnalytics      ok  [ filebeat(v) auditbeat(v) logstash(v) es(v) kibana(v) ]
    Notifications      ok  [ influxdb(v) kapacitor(v) ]

switch to root shell and delete the file

rm -f /etc/appliance/state/boot_mode

Remove osd from storage pool​

Replace a new node​

Choose Advanced option​

Control node must set the flag to rejoin​

Pull snapshot from media​

Apply the setting​

Re-log as admin​

Switch to manual boot​

Exit Admin CLI and relog​

Cluster Sync​

Check and Repair services​

switch to root shell and delete the file​