Video thumbnail for 为什么计算机行业总是由小错误引发大灾难【让编程再次伟大#40】

Google Cloud Outage: How Small Errors Caused a Massive Internet Crash

Summary

Quick Abstract

Explore the recent Google Cloud outage of June 2025, a cascading failure exposing critical vulnerabilities in the cloud infrastructure. This summary dissects the root causes, from a simple code bug to systemic deployment and operational flaws. Learn how seemingly minor software engineering errors snowballed into a global Internet service disruption, impacting Cloudflare and highlighting a concerning lack of rigor within the computer industry.

Quick Takeaways:

  • A null pointer exception in Google Cloud's Service Control system triggered the initial collapse.

  • Skipped staging tests allowed the bug to slip into production, impacting 76 API products.

  • A flawed distributed database deployment strategy amplified the impact globally.

  • The absence of randomized exponential backoff exacerbated the recovery, leading to prolonged outages.

  • Cloudflare's reliance on Google Cloud's Worker's KV for core services resulted in a catastrophic chain reaction.

The Google Cloud Collapse of 2025: A Cascade of Errors

The Google Cloud collapse on June 12, 2025, which resulted in a large-scale disruption of global Internet services, exemplifies how a series of seemingly minor errors can compound to create a major disaster. This event serves as a stark reminder of the importance of rigor and responsibility in software engineering, particularly in critical infrastructure.

The Root Cause: A Null Pointer Exception

The initial trigger for the collapse was a code bug within Google Cloud's Service Control system. Service Control is responsible for managing API requests, handling tasks such as identity verification, blacklisting, whitelisting, and rate limiting. An upgrade to Service Control's linear strategy function on May 29, 2025, introduced a critical flaw: the code failed to handle empty values, leading to a NullPointerException when a new strategy containing an empty value was introduced.

Error 1: Inadequate Error Handling

The first mistake was the lack of proper error handling in the upgraded Service Control code. This fundamental oversight allowed an empty value to trigger a NullPointerException, bringing down the entire system. This demonstrates a failure to adhere to basic software development principles.

Error 2: Skipping Staging Environment Testing

The Service Control team bypassed thorough testing in the staging environment, likely due to the potential impact on other development teams. Google Cloud has 76 API products that rely on Service Control. Testing the new feature in staging could disrupt the daily work of these teams. Instead of proper negotiation or accepting potential delays, the team risked deploying the untested code directly to production, introducing the NullPointerException to the live environment.

Error 3: Real-Time Synchronized Release Strategy

The deployment strategy employed by the Service Control team further exacerbated the problem. Instead of using a gradual rollout approach such as Canary Release, Blue-Green Deployment, or Geo-Based Rollout, they opted for a real-time synchronized release across all 42 global branches of Google Cloud. This meant that when the empty strategy was written into one of the distributed databases, the NullPointerException was immediately replicated across the entire system, triggering a simultaneous collapse.

Error 4: Lack of Randomized Exponential Backoff

Even after the operations team identified and patched the issue within 40 minutes, the system faced another setback. The Service Control system lacked a randomized exponential backoff mechanism. After the system was restored, all the tasks that had been queued up during the outage attempted to execute simultaneously. This overwhelmed the system, particularly in larger branches like US Central One, causing it to crash again and prolonging the outage.

The Impact and Cloudflare's Dependence

The Google Cloud outage lasted for 6 hours and 41 minutes, affecting 76 products across 42 zones globally. The incident also highlighted the vulnerability of relying on a single cloud provider. Cloudflare, a major internet infrastructure provider, experienced a cascading failure because its Worker's KV, a core component, was deployed on Google Cloud. This exposed a critical dependency and raised concerns about the industry's commitment to redundancy and resilience.

Lessons Learned: The Aviation Industry Model

The incident draws parallels to the aviation industry, where meticulous attention to detail and rigorous processes are paramount. The aviation industry learns from every incident, implementing changes to prevent recurrence. The computer industry needs to adopt a similar approach, addressing fundamental software engineering errors and embracing a stronger sense of social responsibility. This includes:

  • Prioritizing thorough testing and error handling.

  • Adopting safer deployment strategies.

  • Implementing robust backoff mechanisms.

  • Promoting a culture of rigor and accountability.

  • Considering social responsibility

Only then can the computer industry prevent similar catastrophic failures and ensure the reliability of the critical infrastructure that underpins modern society.

Was this summary helpful?