Post Mortem: DB Connection Failure in Passwd Production

Date: May 7, 2025
Duration: 4h 48m (05:48 โ€“ 10:36 CEST)

Summary

Between 05:48 and 10:36 on May 7, 2025, the passwd production environment experienced a service disruption due to database connection failures. This affected critical operations such as:

  • Login
  • Subscription checks
  • Metrics reporting

As the system encountered repeated DB connection errors, Cloud Run health checks began to block requests, making login functionality unavailable. Logged-in users were not affected.


Timeline

  • 05:48 โ€“ Monitoring alert triggered: DB connection failure
  • ~06:00 โ€“ Confirmed impact on login and subscription functionality
  • 06:15 โ€“ Mitigation began: system adjusted to bypass DB check and return valid subscriptions
  • 10:36 โ€“ Google Cloud services recovered autonomously; DB connectivity restored
  • Post-incident โ€“ No root cause disclosed by Google. No deploys or infra changes occurred on our side

Root Cause

The root cause remains unknown. Connectivity issues were observed across multiple unrelated GCP projects, not just passwd-production. No configuration or deployment changes were made on our end.

We suspect a transient outage in Google Cloudโ€™s internal network or database services. Google has not provided a formal RCA.


Impact

Affected Users:

  • All users attempting to log in (across all platforms)

Degraded Functionality:

  • Subscription validation (false negatives)
  • Login failures due to failing health checks
  • Failed BigQuery metrics logging

Unaffected:

  • Logged-in sessions continued to function normally

Business Risk:

  • Temporary access granted to users with invalid subscriptions

Mitigation & Recovery

Mitigation Measures:

  • Temporarily bypassed DB call for subscription validation
  • All users were treated as having valid subscriptions to preserve access

Recovery:

  • GCP services recovered autonomously at 10:36
  • No manual recovery actions required beyond the initial workaround

Follow-up Actions

Short-term:

  • Keep the subscription validation bypass workaround ready for future DB failures