Post-mortem: Google Directory API slowness impacting Passwd production

Date: Friday, August 8, 2025
Impact: Elevated latency and 504 errors for some customers in specific regions
Duration: ~2 days (resolved without intervention over the weekend)

Summary

On Friday morning, we observed a spike in 504 Gateway Timeout errors affecting a subset of Passwd customers. The most affected were customers in certain regions — primarily in the Americas — while European customers appeared unaffected. The issue caused significant delays in listing secrets, with response times stretching from a few seconds to several minutes. Requests eventually hit the default Cloud Run timeout of 5 minutes.

The root cause was traced to unusually slow responses from the Google Directory API, which Passwd calls during the list secrets operation to verify group membership and determine access rights. The slowdown appeared isolated to specific Google infrastructure regions.

Impact

Affected customers: Subset of customers in certain regions, especially in the Americas.

Symptoms: Listing secrets took minutes instead of seconds.
Many requests were terminated by Cloud Run after 5 minutes with 504 errors

Duration: The issue persisted from Friday morning until Sunday, when it resolved without changes on our side.

Root Cause: Passwd verifies a user’s right to view secrets by checking group membership via the Google Directory API. During the incident, requests to this API became abnormally slow for some customers, likely due to an upstream Google issue such as degraded service or silent rate limiting.

Because this API call is in the critical path for listing secrets, the slowdown cascaded into overall poor performance for affected customers.

Detection: We detected the incident via elevated error rate monitoring and latency metrics. The spike in 504 errors for specific customers triggered investigation.

Response

Investigation

  • Confirmed slowness was limited to certain regions and not reproducible in European instances.
  • Narrowed down delays to Google Directory API calls.
  • Uncertain whether the root cause was a Google outage, misconfiguration, or rate limiting.

Mitigation attempts

  • Increased caching of Google Directory API responses.
  • Removed unnecessary API calls in the listing flow.
  • These optimizations had no significant effect on the incident while it was active.

Resolution:

  • No direct fix was applied by us.
  • The issue disappeared over the weekend, likely due to a change or fix on Google’s side.
    This is visible on a GCP graph documenting releases with grey dot and request latencies.

Timeline

  • Friday AM: Monitoring alerts triggered by elevated latency and 504 errors.
  • Friday: Root cause isolated to Google Directory API slowness for affected regions.
  • Friday–Saturday: Implemented caching and request reduction, no improvement observed.
  • Sunday: Issue resolved without changes on our side, likely due to an upstream Google fix.
    Lessons Learned
  • Our dependency on the Google Directory API in the critical path makes us vulnerable to upstream slowdowns.
  • Region-specific incidents can be challenging to diagnose without comprehensive geo-distributed testing.