Skip to main content
Version: 1.19.0 (latest)

Understanding High Availability

When Enterprise Architects evaluate a new platform for production readiness, the instinct is to ask, "Is this platform highly available?" That question is appropriate for monolithic systems but it is misplaced for Kasm Workspaces.

Kasm is not a monolith. It is a deliberately decomposed set of service roles, each with a different function, different state characteristics, and different failure behaviors. High availability in Kasm is not a setting to enable or a tier to select. It is an outcome produced by independently analyzing each role.

The right questions to ask are:

  • What is the failure mode of each role?
  • What is the impact of that failure?
  • What redundancy mechanism addresses it?

Resilience Is a Role-by-Role Decision, Not a Platform-Wide Switch

Legacy VDI platforms typically run management, session brokering, and delivery functions inside the same process or tier. When that tier fails, everything fails together. Recovery requires the entire system to come back up, and the blast radius of any single failure touches every user simultaneously.

Kasm separates these concerns into five distinct roles. Because these roles are separated, a failure in one does not automatically cascade into all others:

  • An Agent failure interrupts only the sessions running on that Agent. Users on healthy Agents continue working.
  • An Connection Proxy (CPX) failure interrupts only the sessions running through that node. Users on healthy CPX servers continue working.
  • A Web App failure blocks new session creation but does not terminate sessions already streaming through an Agent.
  • A Database failure is the exception: it is the one role whose failure makes the entire platform unavailable, which is precisely why it demands the most serious redundancy investment.

Role-by-Role: Failure Impact and Redundancy

The Database Role: The Most Critical HA Decision

The database holds all persistent state for the entire deployment. When it is unavailable, the Web App API cannot serve any requests, no session can be created, and no administrative operation can complete.

Redundancy ApproachMechanismSuitable For
AWS RDS Multi-AZSynchronous standby; automatic failoverCloud deployments on AWS
AWS Aurora PostgreSQLMulti-primary compatible clusteringHigh-throughput cloud deployments
Cloud-provider equivalentsVaries by providerAzure, GCP, OCI managed PostgreSQL
Self-managed replication with PatroniPostgreSQL streaming replication; automated failoverOn-premises and private cloud
Single containerized instanceNoneSmall teams, development, and Proof-of-Concept
warning

Kasm ingests logs from all platform components into the database by default. For deployments with hundreds of concurrent sessions, this log volume can reach tens of gigabytes under default retention settings. Organizations running enterprise-scale deployments should forward logs to an external SIEM and reduce Kasm's internal log retention to zero. This both reduces storage requirements and removes I/O load that competes with the operational queries HA depends on.


The Web App Role: Statelessness as the HA Enabler

The Web App role is the easiest tier to make redundant because it was designed to carry no session state of its own. Every piece of state lives in the database. Any Web App instance can serve any request from any user, and the load balancer can route traffic to any healthy instance without coordination between instances.

The recommended minimum for production Web App HA is N+1: one more Web App server than needed to serve peak load. A deployment expecting 200 concurrent sessions requires approximately 2 Web App servers at peak. N+1 means deploying 3, providing both failure tolerance and maintenance flexibility.

PatternConvergence Speed on FailureBest For
DNS load balancingSlow: 90 to 300 seconds for TTL expiryOutermost redundancy tier
Network load balancerFast: seconds via active health checksProduction HA where fast failover is required

The health check endpoint at /api/__healthcheck is the correct target for both DNS health monitors and load balancer health probes. It validates that the Web App process can successfully reach the database, making it a meaningful end-to-end health signal rather than a shallow process check.


The Agent Role: Horizontal Distribution as Fault Isolation

Agent nodes are where workspace sessions execute. An Agent failure directly interrupts sessions on that specific node only. Several properties make this failure mode less catastrophic than it might appear:

Ephemerality: Containers are destroyed at session end. A failed Agent takes no persistent user data with it. Users reconnect and receive a fresh container on a healthy Agent.

Automatic rerouting: The Manager's health monitoring continuously tracks Agent check-ins. When an Agent stops responding, the Manager stops routing new sessions to it and schedules them to remaining healthy Agents automatically.

AutoScaling: In cloud environments, the Manager can provision new Agent virtual machines when existing Agents reach capacity or when a node fails.

warning

AutoScaling requires the Upstream Auth Address to be explicitly configured before AutoScaling is enabled. New Agent virtual machines must be able to reach the Web App API to register. If the Upstream Auth Address is misconfigured, autoscaled Agents silently fail to join the pool, and capacity does not increase when needed. This is the most common AutoScaling failure mode in enterprise deployments.


The Connection Proxy Role: Selective Redundancy

The Connection Proxy is required only when Kasm serves as a web-native gateway to existing RDP, VNC, or SSH endpoints. Containerized workspace sessions always use the Agent role.

For organizations where Windows, Linux, and macOS access is a primary delivery mode, Connection Proxy redundancy follows the same stateless horizontal scaling model as the Web App tier: multiple instances behind a load balancer, with health checks routing traffic away from failed instances.


Failure Domain Isolation Through Zone Design

The Zone model is Kasm's primary mechanism for geographic failure domain isolation, keeping a failure in one region from cascading into another.

The Search Alternate Zones setting determines whether Kasm will route session requests to Agents in a different zone when the assigned zone has no available capacity.

Zone Boundary TypeSearch Alternate ZonesRationale
Geographic preference onlyEnableUsers benefit from fallback; no security boundary is crossed
Security enclave (classification, tenant separation)DisableCross-zone fallback would violate the policy the zone boundary enforces
tip

The default is Enabled, which encodes an implicit assumption that zones are geographic, not regulatory. Organizations deploying zones for compliance or tenant isolation reasons should consciously evaluate whether this default aligns with their security policy before going to production.


HA Decision Summary by Role

RoleRedundancy MechanismMinimum for Production HAKey Failure Mode Without HA
DatabaseManaged service with auto-failover or streaming replication with PatroniAutomatic failover requiredEntire platform unavailable
Web AppN+1 behind DNS or network load balancer2+ recommendedNew session creation blocked; admin UI down
AgentMultiple per zone; AutoScaling in cloud2+ Agents per zoneCapacity reduction; sessions on failed Agent interrupted
Connection ProxyMultiple instances behind load balancerRequired only for RDP, VNC, SSH workspacesLegacy protocol sessions terminated
Dedicated ProxyMultiple instances per zone behind load balancerRequired for multi-zone regional deliverySessions in affected zone interrupted

Why Ephemeral Sessions Fundamentally Simplify HA

In legacy VDI, a failed host takes persistent state with it. Virtual machines accumulate data, configuration, and user work over weeks or months. HA requires shared storage, live migration, or backup and restore workflows that trade data-loss risk against infrastructure cost.

In Kasm, a failed Agent takes nothing meaningful with it. Containers are ephemeral: they are created for a session and destroyed at session end. A failed Agent interrupts the session but cannot take persistent user data with it, because there is no persistent user data to take. The HA problem for Agents reduces to: provision enough Agents that losing one does not exhaust capacity, and provision them quickly enough through AutoScaling that sustained failures do not keep capacity depleted.

The ephemeral state that eliminates persistence-based attacks also eliminates persistence-based HA complexity. Both benefits are consequences of the same architectural decision.


This article is part of the Kasm Workspaces Reference Architecture Explanation Series.