Video thumbnail for That time Google Cloud Platform bricked the Internet…

Google Cloud Outage: How Bad Code Broke the Internet

Summary

Quick Abstract

Uncover the details behind the recent internet outage that crippled Snapchat, Spotify, Discord, and even Google's own services! This summary dives into the software engineering perspective behind Google Cloud's blunder, exploring how a seemingly innocuous code change triggered a cascading failure across the internet, and what it means for the future of cloud infrastructure.

Quick Takeaways:

  • A faulty quota policy check in Google Cloud's API management service caused widespread crashes.

  • The bug lay dormant until a policy change on June 12th triggered a global crash loop.

  • Recovery took approximately four hours, highlighting the fragility of internet infrastructure.

  • The incident impacts Google's reputation and market share, especially concerning Service Level Agreements (SLAs).

  • PostHog is mentioned as a solution for building better products, featuring their AI-powered product analyst, Mac.

Explore the technical breakdown and learn how a single code push brought down major online platforms. Was it truly AI gone rogue?

Last week, a major internet outage impacted services like Snapchat, Spotify, and Discord due to faulty code deployed into production by Google Cloud Platform. The incident, which also affected Cloudflare's workers KV service, resulted in near 100% error rates for over two hours, causing a ripple effect that brought down numerous websites and services. This highlights the significant influence large cloud providers have on the internet infrastructure.

The Impact and Aftermath

The Google Cloud outage wasn't limited to external services; it also affected Google's own services, including Gmail, Google Calendar, and Drive. Outages of this scale can cost companies millions of dollars. Cloud providers typically offer a Service Level Agreement (SLA), guaranteeing a certain level of uptime (often 99.99% or higher).

When an outage violates the SLA, affected companies may be entitled to SLA credits, essentially a refund for the service disruption. While this might cost Google financially, the damage to their reputation as a reliable cloud provider is far more significant, especially considering their third-place position in market share behind Azure and AWS.

The Root Cause: A Dormant Bug

The question arises: how did this disaster happen from a programming standpoint?

While some speculated that AI-generated code (specifically Gemini) was responsible, the cause seems to stem from a human-written code that controls a critical aspect of Google Cloud.

  • When customers send API requests to Google Cloud, they are routed to an API management service.

  • This service verifies authorization and accesses a data store for quota and policy information.

  • The service is deployed across Google's 42 data center regions.

On May 29th, 2025, a new quota policy check was added as a feature but contained a bug. This code path was not properly tested during staging because the policy change needed to trigger the code was never implemented.

The Trigger and the Crash

The buggy code path lacked proper error handling, leading to a null pointer that crashed the binary. The bug remained dormant until June 12th, when a policy change was introduced and replicated globally. This triggered the flawed code, causing the API management binary to enter a crash loop.

The Rollback and Recovery

Google developers had a rollback mechanism in place (a "big red button"), but it took approximately 40 minutes to initiate the rollback and four hours for the system to fully stabilize.

Building Better Products with PostHog

PostHog is presented as an all-in-one platform for building better products, featuring Mac, an AI-powered product analyst and assistant. Mac allows users to:

  • Research answers using natural language.

  • Generate data visualizations.

  • Get tasks done within the PostHog UI.

  • Access documentation.

PostHog combines these AI features with existing tools like analytics, feature flags, and session replays.

Was this summary helpful?