Event site availability issues

April 21, 2026 at 8:34 PM UTC

Resolved after 40m of service disruption

Resolved - Event sites returning to full availability. Post-incident review underway. (Apr 21, 2026)

Monitoring - Mitigation applied. A small number of short timeouts continue to be observed as origin capacity re-balances. (Apr 21, 2026)

Investigating - Event sites are intermittently returning gateway timeouts for visitors. Registration, channels, and Zoom integration remain unaffected on healthy routes. (Apr 21, 2026)

Post-incident report

Published (Apr 22, 2026)

At 8:34 PM UTC on April 21, 2026, we were notified of a critical outage affecting event sites hosted in our Google Cloud us-central1 (Iowa) region. Our on-call team immediately engaged senior engineering and our infrastructure partners to restore accessibility to event sites.

The primary outage window lasted 40 minutes. Shorter recurrences were observed in the following hours as origin capacity re-balanced, and were resolved without further intervention. Registration, channels, and Zoom integration continued to function on healthy routes throughout.

This was classified as a tier-3 incident. Mitigations were deployed on affected sites to minimize service disruption during the recovery window.

As this is one of the longest and broadest service disruptions in Jumbo’s history, we are reviewing our backup and failover planning to ensure faster recovery for future events of this nature.

Incident detail from our infrastructure partners

The following root-cause analysis was provided by our infrastructure partners at Google Cloud following their review of the affected hosting node.

The outage was traced to a single hosting node that entered a high-load state, with web-server worker processes caught in a repeated out-of-memory restart loop that saturated the node’s CPU. That resource pressure starved the system threads responsible for moving network traffic, and the node’s transmit queue stalled. The operating system’s network watchdog detected the stall and automatically reset the node’s virtual network adapter.

During that reset, the adapter was placed into a protective “no-interrupt” state. A known race condition in the network driver shipped with the node’s operating system (Linux kernel 6.8) meant the driver failed to bring the adapter back out of that state. With interrupts disabled, incoming traffic filled the adapter’s receive buffers until every queue was full, at which point the node dropped all inbound connections to the sites it was hosting.

The node returned to service when an unrelated internal service restarted and forced the network stack to reinitialize, which cleared the stuck state and allowed traffic to resume.

To prevent recurrence, our partners identified the following measures:

Tuning the worker-process memory limits that allowed the node to enter the runaway-load state, removing the condition that triggers the fault.
Deploying an updated, out-of-tree version of the network driver (v1.4.2 or later) that forces the adapter back into service after a reset, as a near-term safeguard.
Upgrading the cluster to a newer platform version (GKE 1.35 or later) whose operating system kernel resolves the underlying driver defect permanently.