Google Cloud Incident Report Today

Multiple Products

Incident Report

Summary

On Friday, 18 July 2025 07:50 US/Pacific, several Google Cloud Platform (GCP) and Google Workspace (GWS) products experienced elevated latencies and failure rates in the us-east1 region for a duration of up to 1 hour and 57 minutes.

GCP Impact Duration:* 18 July 2025 07:50 - 09:47 US/Pacific : 1 hour 57 minutes
GWS Impact Duration:* 18 July 2025 07:50 - 08:40 US/Pacific : 50 minutes

We sincerely apologize for this incident, which does not reflect the level of quality and reliability we strive to offer. We are taking immediate steps to improve the platform’s performance and availability.

Root Cause

The service interruption was triggered by a procedural error during a planned hardware replacement in our datacenter. An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal. The redundant unit had been properly de-configured as part of the procedure, and the combination of these two events led to partitioning of the network control plane. Our network is designed to withstand this type of control plane failure by failing open, continuing operation.

However, an operational topology change while the network control plane was in a failed open state caused our network fabric's topology information to become stale. This led to lost data packets and service disruption until services were moved away from the fabric and control plane connectivity was restored.

Remediation and Prevention

Google engineers were alerted to the outage by our monitoring system on 18 July 2025 07:06 US/Pacific and immediately started an review. The following timeline details the remediation and restoration efforts:

07:39 US/Pacific: The underlying root cause (device disconnect) was identified and onsite technicians were engaged to reconnect the control plane device and restore control plane connectivity. At that moment, network failure open mechanisms worked as expected and no impact was observed.
07:50 US/Pacific: A topology change led to traffic being routed suboptimally, due to the network being in a fail open state. This caused traffic slowdown on the subset of links, lost data packets, and response delays to customer traffic. Engineers made a decision to move traffic away from the affected fabric, which temporarily patched the impact for the majority of the services.
08:40 US/Pacific: Engineers temporarily patched Workspace impact by shifting traffic away from the affected region.
09:47 US/Pacific: Onsite technicians reconnected the device, control plane connectivity was fully restored and all services were back to stable state.

Google is committed to preventing a repeat of the issue in the future, and is completing the following actions:

Pause non-critical workflows until safety controls are implemented (complete).
Strengthen safety controls for hardware upgrade workflows by end of Q3 2025\.
Design and implement a mechanism to prevent control plane partitioning in case of dual failure of upstream routers by end of Q4 2025\.

Detailed Description of Impact

GCP Impact:

Multiple products in us-east1 were affected by the loss of network connectivity, with the most significant impacts seen in us-east1-b. Other regions were not affected.

The outage caused a range of issues for customers with zonal resources in the region, including lost data packets across VPC networks, increased failure rates and response delays, service unavailable (503) errors, and slow or stuck operations up to loss of networking connectivity. While regional products were briefly impacted, they recovered quickly by failing over to unaffected zones.

A small number (0.1%) of Persistent Disks in us-east1-b were unavailable for the duration of the outage: these disks became available once the outage was temporarily patched, with no customer data loss.

GWS Impact:

A small subset of Workspace users, primarily around the Southeast US, experienced varying degrees of unavailability and increased delays across multiple products, including Gmail, Google Meet, Google Drive, Google Chat, Google Calendar, Google Groups, Google Doc/Editors, and Google Voice.

Duration: Active

Multiple Products

Incident Report

Summary

Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API (the engine that lets different apps talk to each other) requests, impacting customers.*
We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.*

What happened?

Google and Google Cloud APIs are served through our Google API (the engine that lets different apps talk to each other) management and control planes. Distributed regionally, these management and control planes are responsible for ensuring each API (the engine that lets different apps talk to each other) request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints. The core binary that is part of this policy check system is known as Service Control. Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

On June 12, 2025 at \~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.

Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out \~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to \~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional data storage systems to reduce the load. At that point, Service Control and API (the engine that lets different apps talk to each other) serving was fully recovered across all regions. Corresponding Google and Google Cloud products started recovering with some taking longer depending upon their architecture.

What is our immediate path forward?

Immediately upon recovery, we froze all changes to the Service Control stack and manual policy pushes until we can completely remediate the system.

How did we communicate?

We posted our first incident report to Cloud Service Health about \~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure. We will address this going forward.

What’s our approach moving forward?

Beyond freezing the system as mentioned above, we will prioritize and safely complete the following:

We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API (the engine that lets different apps talk to each other) requests.
We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.
We will enforce all changes to critical binaries to be feature flag protected and disabled by default.
We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.
We will audit and ensure our systems employ randomized exponential backoff.
We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers.
We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.
------.

Duration: Active