Version: 1.19.0 (latest)

Recover from quorum loss

When a Raft cluster loses too many members to maintain quorum (more than half the configured members are unreachable), the controllers that remain become read-only — the cluster cannot make decisions. This typically happens when an HA controller goes down hard before it can be cleanly removed from membership: hardware failure, terminated VM, network partition that does not heal.

kziti deploy ha recover is the rescue path. It spins up a short-lived ephemeral peer that impersonates the dead member to restore quorum, runs the remove-member step, then cleans up. Edge state — users, services, routers, policies — is preserved.

warning

This procedure is for recovering from a node that is permanently gone or not coming back. If the missing node is just temporarily unreachable (network blip, host restart, planned maintenance), wait for it to return rather than running recovery. Removing a node that later rejoins creates split-brain risk.

Important

You have a working kziti deployment.
At least one controller in the cluster is still reachable; you will run the recovery command on that host.
You know the node name of the dead member (the value of --node-name from when it was originally added).

Confirm quorum is lost

On a surviving controller host, list cluster members:

docker compose -f /opt/kziti/docker-compose.yml exec ziti-controller \
  ziti agent cluster list

Quorum is lost if the dead member appears as voter: true and the cluster is unable to elect a leader. Pure read access (listing identities, services) typically still works; mutating commands fail with timeouts or quorum errors.

Run recovery

On a surviving controller host:

kziti deploy ha recover --node ziti-c-2

--node is the dead member's node name. The command performs the following steps:

Reads cluster state to confirm ziti-c-2 is still a member.
Starts an ephemeral peer container with the same node identity, joining as a non-voter.
The cluster regains quorum once the ephemeral peer is connected.
remove-member ziti-c-2 runs against the cluster leader.
The ephemeral peer is torn down.

This typically completes in under a minute.

Verify

docker compose -f /opt/kziti/docker-compose.yml exec ziti-controller \
  ziti agent cluster list

The dead node should be gone from the member list. The cluster should report a healthy leader and the remaining controllers should all be voters.

Try a mutating operation to confirm write quorum is restored (read-only commands still work even when quorum is lost, so they do not prove recovery):

kziti network create quorum-test "Quorum Test" && \
  kziti network delete quorum-test --yes

If both commands complete without error, the cluster is accepting writes and is back to normal operation.

After recovery

Quorum is restored, but you are now running with one fewer controller than your target. To restore the original HA topology, add a new HA controller on a fresh host (or on the recovered host if you have repaired it).

Confirm quorum is lost​

Run recovery​

Verify​

After recovery​

Confirm quorum is lost

Run recovery

Verify

After recovery