Incident Report: May 19, 20...

Resolved

Incident Report: May 19, 2026- Upstream Provider GCP Issue

May 19, 2026 at 10:10pm UTC

Affected services

Command Bridge CAD Interface

Resolved
May 20, 2026 at 2:00pm UTC

This report reflects what is known at the time of publication and may be updated pending Google Cloud's internal review.

One of our upstream service providers experienced a platform-wide service disruption due to Google Cloud incorrectly placing their account in a suspended status. This resulted in a temporary loss of service for all GCP hosted infrastructure. This infrastructure supports their dashboard, API, and pieces of their network infrastructure. As cached network routes expired, the outage extended beyond GCP to affect all provider workloads.

Impact

On May 19, 2026 between 22:20 UTC and approximately 06:14 UTC on May 20 (~8 hours), our provider experienced a platform-wide outage after Google Cloud suspended services on their production account. This took their API, control plane and databases offline, along with compute infrastructure hosted on Google Cloud.

While workloads on Metal and AWS burst-cloud environments remained up, edge proxies rely on a Google Cloud-hosted control plane API to populate their routing tables, causing the outage to cascade beyond Google Cloud. As the route caches expired, these other workloads became unreachable, resulting in returning 404 errors as the network control plane could no longer resolve routes to active instances. At peak impact, all workloads across all regions were rendered unreachable.

This provider hosts our primary CAD Interface infrastructure. As a result, requests to the interface began failing as routes were lost.

We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage, and detail below what happened, how we recovered, and the changes we are making to prevent this from happening again.

Incident Timeline
* May 19, 22:10 UTC - Automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
* May 19, 22:11 UTC - [Provider] Dashboard returning 503 errors. Users unable to log in.
*May 19, 22:19 UTC - [Provider] Root cause identified: Google Cloud Platform has suspended provider's production account.
* May 19, 22:22 UTC - [Provider] P0 ticket filed with Google Cloud.GCP account manager engaged directly.
* May 19, 22:29 UTC - [Provider] Incident declared by upstream provider.
* May 19, 22:29 UTC - [Provider] GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.
* May 19, 22:35 UTC - [Provider] Cached network routes began expiring; workloads on Metal and AWS began returning 404 errors as the networking could no longer resolve routes.
* May 19, 22:42 UTC - Backup interface booted on local servers and DNS rerouted to secondary systems
* May 19, 22:45 UTC - Backup interface system fully online. Command Bridge reporting 100% interface function on secondary systems
* May 19, 23:09 UTC - [Provider] First persistent disk comes back online at provider.
* May 19, 23:54 UTC - [Provider] All persistent disks restored to ready state. Network still down.
* May 20, 00:39 UTC - [Provider] Disks confirmed ready. Recovery blocked on Google Cloud networking restoration.
* May 20, 01:30 UTC - [Provider] Compute instances began recovering.
* May 20, 01:38 UTC - [Provider] Edge traffic being served again. Networking restored.
* May 20, 02:04 UTC - [Provider] Compute hosts being brought back online incrementally.
* May 20, 06:14 UTC - [Provider] Incident moved to monitoring.
* May 20, 07:58 UTC - [Provider] Incident is resolved.
* May 20, 15:00 UTC - Moved back to primary interface infrastructure once provider confirmed to be stable

What Happened?
At 22:20 UTC on May 19, Google Cloud placed an upstream provider's production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.

This suspended status disabled GCP related infrastructure, which supports the Dashboard, API and parts of the Network infrastructure, along with additional burst-compute infrastructure hosted on Google Cloud.

Edge proxies maintain a cache of routing tables from the network control plane, which is hosted within Google Cloud. Once the cache expired, the edge could no longer resolve routes to active instances, and workloads across all regions began returning 404 errors. This caused the network outage impact to cascade beyond Google Cloud, into these regions as well, even though the workloads themselves remained online.

Our provider's infrastructure is designed for high availability. Databases run across multiple availability zones, and the network uses redundant connections between AWS, GCP, and Metal. However, restoring account access did not restore these individual services. Persistent disks, compute instances, and networking all required separate recovery. Due to the nature of this recovery process, the outage was extended by several hours. Disks were restored to a ready state by 23:54 UTC, but core networking and edge routing did not fully restore until approximately 01:30 UTC on May 20. (We are awaiting confirmation to see if this delay and associated errors were on Google’s side)

By approximately 04:00 UTC on May 20 endpoints were confirmed operational. Once we had confirmation that the provider was stable and back up, we incrementally reverted back to the primary interfacing system at 15:00 UTC on May 20.

Preventative Measures

Our providor’s network control plane is designed for resilience. It is a multi-AZ, multi-zone control plane which can tolerate the loss of multiple machines and components, while still functioning with zero user impact. This has been tested in both staging as well as live traffic.

The network is a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud. This meant that despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables.

We are immediately working on removing this dependency, making this a true mesh. This means that if any of the interconnects go out, there is always a path between the clouds.

As a result of this, we will be extending the high availability database shards across AWS and Metal. In the future, should all instances in a particular cloud disappear instantly, database quorum will keep everything running and immediately failover any no longer running workloads.

Finally, they are in planning to remove Google Cloud services from the data plane’s hot path, and keeping them only for secondary/failover. This is in parallel to implementing a new architecture for the data and control planes. These architecture upgrades will ensure that core services, especially user facing components, are not dependent on any one vendor or platform.

We own our vendor choices, and we ultimately own this one. We have assessed the changes being proposed and in comparison to other systems, still believe this is the best usable option for Command Bridge. We will be implementing more resilient and automated failover to our secondary systems to avoid a manual failover in the future.

Updated
May 19, 2026 at 10:45pm UTC

May 19, 22:35 UTC - [Provider] Cached network routes began expiring; workloads on Metal and AWS began returning 404 errors as the networking could no longer resolve routes.
May 19, 22:42 UTC - Backup interface booted on local servers and DNS rerouted to secondary systems
May 19, 22:45 UTC - Backup interface system fully online. Command Bridge reporting 100% interface function on secondary systems

Created
May 19, 2026 at 10:10pm UTC

May 19, 22:10 UTC - Automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue