Skip to main content
Version: 2.5

Replacing a Failed Controller Node

In our CubeCOS cluster, one of the controller nodes had a hardware failure and could no longer function. Since controller nodes are critical for managing the cloud, we needed to replace it quickly to keep the system running smoothly.

We removed the failed node from the cluster and prepared a new server with the same hostname and base configuration. After reinstalling the OS, we followed a step-by-step process to restore the CubeCOS services and rejoin the node back into the cluster.

This guide applies to the following roles:

  • control
  • control-converged
  • edge-core
  • moderator

It explains how to:

  • Rebuild the node
  • Reconfigure OpenStack services
  • Rejoin the node to the HA cluster
  • Verify the system is working as expected

Remove osd from storage pool

  1. Connect to any healthy node in the cluster.
  2. Run the remove_osd command to remove each failed OSD.
  3. Repeat the process until all missing OSDs have been removed.
  4. Use status to verify that all offline OSDs are gone.
  cc2:storage> remove_osd
Enter osd id to be removed:
1: osd.0 (hdd)
2: osd.1 (hdd)
3: osd.2 (hdd)
4: osd.3 (hdd)
5: osd.4 (hdd)
6: osd.5 (hdd)
7: osd.6 (hdd)
8: osd.7 (hdd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.0 successfully.

Replace a new node

Choose Advanced option

First Time Setup Options:
1: Wizard
2: Advanced
Enter index: 2

Control node must set the flag to rejoin

Welcome to the Cube Appliance
Enter "help" for a list of available commands
unconfigured> first
unconfigured:first> control_rejoin
Set or clear control rejoin flag?
1: set
2: clear
Enter index: 1
Control rejoin markers set

Pull snapshot from media

unconfigured> snapshot
unconfigured:snapshot> pull
Select a media:
1: usb
2: nfs
Enter index: 1
Insert a USB drive into the USB port on the appliance.
Enter 'YES' to confirm: YES
1: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 1
Copying...
Automatically generated on 2024-12-08 13:04:32
Copy complete. It is safe to remove the USB drive.

Apply the setting

unconfigured:snapshot> apply
1: CUBE_2.4.3_20241208-054308.133092_unconfigured.snapshot
2: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 2
Automatically generated on 2024-12-08 13:04:32
Date/Time is important for applying changes to an unconfigured box.
Please confirm the current time is good.

* Local Time: 12/08/2024 00:46:58 EST

Enter 'YES' to confirm: YES
Policy snapshot file applied

Re-log as admin

unconfigured:snapshot> exit
cc1 Login: admin
Password:
Welcome to the Cube Appliance
Enter "help" for a list of available commands
Notice: your license will expire in 29 days.
Please contact system administrator to renew the license.
cc1>

Switch to manual boot

cc1:boot_mode> manual
Switch to manual bootstrap mode:
1: one-time manual
2: always manual
Enter index: 1
Enter 'YES' to confirm: YES
Exit current CLI and log in again as admin to see changes in effect

Exit Admin CLI and relog

cc1:boot_mode> exit
# su admin
Welcome to the Cube Appliance
Enter "help" for a list of available commands
License (type: trial) is valid for 102 days
cc1>

Cluster Sync

cc1> boot cluster_sync
(1/9) processing: ceph
(2/9) processing: keystone
(3/9) processing: neutron
(4/9) processing: nova
(5/9) processing: cinder
(6/9) processing: manila
(7/9) processing: octavia
(8/9) processing: haproxy
(9/9) processing: telegraf
cluster_sync successfully

Check and Repair services

cc1> cluster check_repair
Service Status Report
ClusterLink ok [ link(v) clock(v) dns(v) ]
ClusterSys ok [ bootstrap(v) license(v) ]
ClusterSettings ok [ etcd(v) ]
HaCluster FIXING [ hacluster(3) ]
ok [ hacluster(f) ]
MsgQueue ok [ rabbitmq(v) ]
IaasDb ok [ mysql(v) ]
VirtualIp ok [ vip(v) haproxy_ha(v) ]
Storage ok [ ceph(v) ceph_mon(v) ceph_mgr(v) ceph_mds(v) ceph_osd(v) ceph_rgw(v) rbd_target(v) ]
ApiService ok [ haproxy(v) httpd(v) lmi(v) memcache(v) ]
SingleSignOn ok [ keycloak(v) ]
Compute FIXING [ nova(8) ]
ok [ nova(f) ]
Baremetal ok [ ironic(v) ]
Network FIXING [ neutron(3) ]
ok [ neutron(f) ]
Image ok [ glance(v) ]
BlockStor ok [ cinder(v) ]
FileStor ok [ manila(v) ]
ObjectStor ok [ swift(v) ]
Orchestration ok [ heat(v) ]
LBaaS ok [ octavia(v) ]
DNSaaS ok [ designate(v) ]
K8SaaS ok [ k3s(v) rancher(v) ]
InstanceHa ok [ masakari(v) ]
BusinessLogic ok [ senlin(v) watcher(v) ]
DataPipe ok [ zookeeper(v) kafka(v) ]
Metrics ok [ monasca(v) telegraf(v) grafana(v) ]
LogAnalytics ok [ filebeat(v) auditbeat(v) logstash(v) es(v) kibana(v) ]
Notifications ok [ influxdb(v) kapacitor(v) ]

switch to root shell and delete the file

rm -f /etc/appliance/state/boot_mode