The Google Cloud Collapse of 2025: A Cascade of Errors
The Google Cloud collapse on June 12, 2025, which resulted in a large-scale disruption of global Internet services, exemplifies how a series of seemingly minor errors can compound to create a major disaster. This event serves as a stark reminder of the importance of rigor and responsibility in software engineering, particularly in critical infrastructure.
The Root Cause: A Null Pointer Exception
The initial trigger for the collapse was a code bug within Google Cloud's Service Control system. Service Control is responsible for managing API requests, handling tasks such as identity verification, blacklisting, whitelisting, and rate limiting. An upgrade to Service Control's linear strategy function on May 29, 2025, introduced a critical flaw: the code failed to handle empty values, leading to a NullPointerException when a new strategy containing an empty value was introduced.
Error 1: Inadequate Error Handling
The first mistake was the lack of proper error handling in the upgraded Service Control code. This fundamental oversight allowed an empty value to trigger a NullPointerException, bringing down the entire system. This demonstrates a failure to adhere to basic software development principles.
Error 2: Skipping Staging Environment Testing
The Service Control team bypassed thorough testing in the staging environment, likely due to the potential impact on other development teams. Google Cloud has 76 API products that rely on Service Control. Testing the new feature in staging could disrupt the daily work of these teams. Instead of proper negotiation or accepting potential delays, the team risked deploying the untested code directly to production, introducing the NullPointerException to the live environment.
Error 3: Real-Time Synchronized Release Strategy
The deployment strategy employed by the Service Control team further exacerbated the problem. Instead of using a gradual rollout approach such as Canary Release, Blue-Green Deployment, or Geo-Based Rollout, they opted for a real-time synchronized release across all 42 global branches of Google Cloud. This meant that when the empty strategy was written into one of the distributed databases, the NullPointerException was immediately replicated across the entire system, triggering a simultaneous collapse.
Error 4: Lack of Randomized Exponential Backoff
Even after the operations team identified and patched the issue within 40 minutes, the system faced another setback. The Service Control system lacked a randomized exponential backoff mechanism. After the system was restored, all the tasks that had been queued up during the outage attempted to execute simultaneously. This overwhelmed the system, particularly in larger branches like US Central One, causing it to crash again and prolonging the outage.
The Impact and Cloudflare's Dependence
The Google Cloud outage lasted for 6 hours and 41 minutes, affecting 76 products across 42 zones globally. The incident also highlighted the vulnerability of relying on a single cloud provider. Cloudflare, a major internet infrastructure provider, experienced a cascading failure because its Worker's KV, a core component, was deployed on Google Cloud. This exposed a critical dependency and raised concerns about the industry's commitment to redundancy and resilience.
Lessons Learned: The Aviation Industry Model
The incident draws parallels to the aviation industry, where meticulous attention to detail and rigorous processes are paramount. The aviation industry learns from every incident, implementing changes to prevent recurrence. The computer industry needs to adopt a similar approach, addressing fundamental software engineering errors and embracing a stronger sense of social responsibility. This includes:
-
Prioritizing thorough testing and error handling.
-
Adopting safer deployment strategies.
-
Implementing robust backoff mechanisms.
-
Promoting a culture of rigor and accountability.
-
Considering social responsibility
Only then can the computer industry prevent similar catastrophic failures and ensure the reliability of the critical infrastructure that underpins modern society.