Last week, a major internet outage impacted services like Snapchat, Spotify, and Discord due to faulty code deployed into production by Google Cloud Platform. The incident, which also affected Cloudflare's workers KV service, resulted in near 100% error rates for over two hours, causing a ripple effect that brought down numerous websites and services. This highlights the significant influence large cloud providers have on the internet infrastructure.
The Impact and Aftermath
The Google Cloud outage wasn't limited to external services; it also affected Google's own services, including Gmail, Google Calendar, and Drive. Outages of this scale can cost companies millions of dollars. Cloud providers typically offer a Service Level Agreement (SLA), guaranteeing a certain level of uptime (often 99.99% or higher).
When an outage violates the SLA, affected companies may be entitled to SLA credits, essentially a refund for the service disruption. While this might cost Google financially, the damage to their reputation as a reliable cloud provider is far more significant, especially considering their third-place position in market share behind Azure and AWS.
The Root Cause: A Dormant Bug
The question arises: how did this disaster happen from a programming standpoint?
While some speculated that AI-generated code (specifically Gemini) was responsible, the cause seems to stem from a human-written code that controls a critical aspect of Google Cloud.
-
When customers send API requests to Google Cloud, they are routed to an API management service.
-
This service verifies authorization and accesses a data store for quota and policy information.
-
The service is deployed across Google's 42 data center regions.
On May 29th, 2025, a new quota policy check was added as a feature but contained a bug. This code path was not properly tested during staging because the policy change needed to trigger the code was never implemented.
The Trigger and the Crash
The buggy code path lacked proper error handling, leading to a null pointer that crashed the binary. The bug remained dormant until June 12th, when a policy change was introduced and replicated globally. This triggered the flawed code, causing the API management binary to enter a crash loop.
The Rollback and Recovery
Google developers had a rollback mechanism in place (a "big red button"), but it took approximately 40 minutes to initiate the rollback and four hours for the system to fully stabilize.
Building Better Products with PostHog
PostHog is presented as an all-in-one platform for building better products, featuring Mac, an AI-powered product analyst and assistant. Mac allows users to:
-
Research answers using natural language.
-
Generate data visualizations.
-
Get tasks done within the PostHog UI.
-
Access documentation.
PostHog combines these AI features with existing tools like analytics, feature flags, and session replays.