Date: May 7, 2025
Duration: 4h 48m (05:48 โ 10:36 CEST)
Summary
Between 05:48 and 10:36 on May 7, 2025, the passwd production environment experienced a service disruption due to database connection failures. This affected critical operations such as:
- Login
- Subscription checks
- Metrics reporting
As the system encountered repeated DB connection errors, Cloud Run health checks began to block requests, making login functionality unavailable. Logged-in users were not affected.
Timeline
- 05:48 โ Monitoring alert triggered: DB connection failure
- ~06:00 โ Confirmed impact on login and subscription functionality
- 06:15 โ Mitigation began: system adjusted to bypass DB check and return valid subscriptions
- 10:36 โ Google Cloud services recovered autonomously; DB connectivity restored
- Post-incident โ No root cause disclosed by Google. No deploys or infra changes occurred on our side
Root Cause
The root cause remains unknown. Connectivity issues were observed across multiple unrelated GCP projects, not just passwd-production
. No configuration or deployment changes were made on our end.
We suspect a transient outage in Google Cloudโs internal network or database services. Google has not provided a formal RCA.
Impact
Affected Users:
- All users attempting to log in (across all platforms)
Degraded Functionality:
- Subscription validation (false negatives)
- Login failures due to failing health checks
- Failed BigQuery metrics logging
Unaffected:
- Logged-in sessions continued to function normally
Business Risk:
- Temporary access granted to users with invalid subscriptions
Mitigation & Recovery
Mitigation Measures:
- Temporarily bypassed DB call for subscription validation
- All users were treated as having valid subscriptions to preserve access
Recovery:
- GCP services recovered autonomously at 10:36
- No manual recovery actions required beyond the initial workaround
Follow-up Actions
Short-term:
- Keep the subscription validation bypass workaround ready for future DB failures