Recover from quorum loss
When a Raft cluster loses too many members to maintain quorum (more than half the configured members are unreachable), the controllers that remain become read-only — the cluster cannot make decisions. This typically happens when an HA controller goes down hard before it can be cleanly removed from membership: hardware failure, terminated VM, network partition that does not heal.
kziti deploy ha recover is the rescue path. It spins up a short-lived ephemeral peer that impersonates the dead member to restore quorum, runs the remove-member step, then cleans up. Edge state — users, services, routers, policies — is preserved.
This procedure is for recovering from a node that is permanently gone or not coming back. If the missing node is just temporarily unreachable (network blip, host restart, planned maintenance), wait for it to return rather than running recovery. Removing a node that later rejoins creates split-brain risk.
- You have a working kziti deployment.
- At least one controller in the cluster is still reachable; you will run the recovery command on that host.
- You know the node name of the dead member (the value of
--node-namefrom when it was originally added).
Confirm quorum is lost
On a surviving controller host, list cluster members:
docker compose -f /opt/kziti/docker-compose.yml exec ziti-controller \
ziti agent cluster list
Quorum is lost if the dead member appears as voter: true and the cluster is unable to elect a leader. Pure read access (listing identities, services) typically still works; mutating commands fail with timeouts or quorum errors.
Run recovery
On a surviving controller host:
kziti deploy ha recover --node ziti-c-2
--node is the dead member's node name. The command performs the following steps:
- Reads cluster state to confirm
ziti-c-2is still a member. - Starts an ephemeral peer container with the same node identity, joining as a non-voter.
- The cluster regains quorum once the ephemeral peer is connected.
remove-member ziti-c-2runs against the cluster leader.- The ephemeral peer is torn down.
This typically completes in under a minute.
Verify
docker compose -f /opt/kziti/docker-compose.yml exec ziti-controller \
ziti agent cluster list
The dead node should be gone from the member list. The cluster should report a healthy leader and the remaining controllers should all be voters.
Try a mutating operation to confirm write quorum is restored (read-only commands still work even when quorum is lost, so they do not prove recovery):
kziti network create quorum-test "Quorum Test" && \
kziti network delete quorum-test --yes
If both commands complete without error, the cluster is accepting writes and is back to normal operation.
After recovery
Quorum is restored, but you are now running with one fewer controller than your target. To restore the original HA topology, add a new HA controller on a fresh host (or on the recovered host if you have repaired it).