Replacing a Failed Controller Node
In our CubeCOS cluster, one of the controller nodes had a hardware failure and could no longer function. Since controller nodes are critical for managing the cloud, we needed to replace it quickly to keep the system running smoothly.
We removed the failed node from the cluster and prepared a new server with the same hostname and base configuration. After reinstalling the OS, we followed a step-by-step process to restore the CubeCOS services and rejoin the node back into the cluster.
This guide applies to the following roles:
- control
- control-converged
- edge-core
- moderator
It explains how to:
- Rebuild the node
- Reconfigure OpenStack services
- Rejoin the node to the HA cluster
- Verify the system is working as expected
Remove osd from storage pool
- Connect to any healthy node in the cluster.
- Run the remove_osd command to remove each failed OSD.
- Repeat the process until all missing OSDs have been removed.
- Use status to verify that all offline OSDs are gone.
cc2:storage> remove_osd
Enter osd id to be removed:
1: osd.0 (hdd)
2: osd.1 (hdd)
3: osd.2 (hdd)
4: osd.3 (hdd)
5: osd.4 (hdd)
6: osd.5 (hdd)
7: osd.6 (hdd)
8: osd.7 (hdd)
Enter index: 1
Enter 'YES' to confirm: YES
Remove osd.0 successfully.
Replace a new node
Choose Advanced
option
First Time Setup Options:
1: Wizard
2: Advanced
Enter index: 2
Control node must set the flag to rejoin
Welcome to the Cube Appliance
Enter "help" for a list of available commands
unconfigured> first
unconfigured:first> control_rejoin
Set or clear control rejoin flag?
1: set
2: clear
Enter index: 1
Control rejoin markers set
Pull snapshot from media
unconfigured> snapshot
unconfigured:snapshot> pull
Select a media:
1: usb
2: nfs
Enter index: 1
Insert a USB drive into the USB port on the appliance.
Enter 'YES' to confirm: YES
1: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 1
Copying...
Automatically generated on 2024-12-08 13:04:32
Copy complete. It is safe to remove the USB drive.
Apply the setting
unconfigured:snapshot> apply
1: CUBE_2.4.3_20241208-054308.133092_unconfigured.snapshot
2: CUBE_2.4.3_20241208-130432.765318_cc1.snapshot
Enter index: 2
Automatically generated on 2024-12-08 13:04:32
Date/Time is important for applying changes to an unconfigured box.
Please confirm the current time is good.
* Local Time: 12/08/2024 00:46:58 EST
Enter 'YES' to confirm: YES
Policy snapshot file applied
Re-log as admin
unconfigured:snapshot> exit
cc1 Login: admin
Password:
Welcome to the Cube Appliance
Enter "help" for a list of available commands
Notice: your license will expire in 29 days.
Please contact system administrator to renew the license.
cc1>
Switch to manual boot
cc1:boot_mode> manual
Switch to manual bootstrap mode:
1: one-time manual
2: always manual
Enter index: 1
Enter 'YES' to confirm: YES
Exit current CLI and log in again as admin to see changes in effect
Exit Admin CLI and relog
cc1:boot_mode> exit
# su admin
Welcome to the Cube Appliance
Enter "help" for a list of available commands
License (type: trial) is valid for 102 days
cc1>
Cluster Sync
cc1> boot cluster_sync
(1/9) processing: ceph
(2/9) processing: keystone
(3/9) processing: neutron
(4/9) processing: nova
(5/9) processing: cinder
(6/9) processing: manila
(7/9) processing: octavia
(8/9) processing: haproxy
(9/9) processing: telegraf
cluster_sync successfully
Check and Repair services
cc1> cluster check_repair
Service Status Report
ClusterLink ok [ link(v) clock(v) dns(v) ]
ClusterSys ok [ bootstrap(v) license(v) ]
ClusterSettings ok [ etcd(v) ]
HaCluster FIXING [ hacluster(3) ]
ok [ hacluster(f) ]
MsgQueue ok [ rabbitmq(v) ]
IaasDb ok [ mysql(v) ]
VirtualIp ok [ vip(v) haproxy_ha(v) ]
Storage ok [ ceph(v) ceph_mon(v) ceph_mgr(v) ceph_mds(v) ceph_osd(v) ceph_rgw(v) rbd_target(v) ]
ApiService ok [ haproxy(v) httpd(v) lmi(v) memcache(v) ]
SingleSignOn ok [ keycloak(v) ]
Compute FIXING [ nova(8) ]
ok [ nova(f) ]
Baremetal ok [ ironic(v) ]
Network FIXING [ neutron(3) ]
ok [ neutron(f) ]
Image ok [ glance(v) ]
BlockStor ok [ cinder(v) ]
FileStor ok [ manila(v) ]
ObjectStor ok [ swift(v) ]
Orchestration ok [ heat(v) ]
LBaaS ok [ octavia(v) ]
DNSaaS ok [ designate(v) ]
K8SaaS ok [ k3s(v) rancher(v) ]
InstanceHa ok [ masakari(v) ]
BusinessLogic ok [ senlin(v) watcher(v) ]
DataPipe ok [ zookeeper(v) kafka(v) ]
Metrics ok [ monasca(v) telegraf(v) grafana(v) ]
LogAnalytics ok [ filebeat(v) auditbeat(v) logstash(v) es(v) kibana(v) ]
Notifications ok [ influxdb(v) kapacitor(v) ]
switch to root shell and delete the file
rm -f /etc/appliance/state/boot_mode