Services Offline

Incident Report for VAULT

Postmortem

Timeline

11:49 AM ET – Initial alert triggered
5:24 PM ET – Authentication services restored
5:58 PM ET – Full service restoration

During the outage window, there were intermittent 2–5 minute periods where authentications briefly recovered as AWS updates were applied. However, most users experienced continuous disruption throughout the event.

Root Cause
The incident was caused by an internal configuration error during routine preparation of new infrastructure. A single AWS flag was incorrectly set, which applied non-production-ready changes to the production environment. Because these changes were propagated across all resources, our regional failover did not activate as expected.

Why Resolution Took Time
We identified the root cause within minutes and initiated recovery procedures. However, AWS resource errors prevented restoration from completing successfully. Each recovery attempt required:

20–30 minutes to process, and ultimately to fail
10–20 minutes for cleanup
5-10 minutes to patch the failure and retry

This significantly extended the recovery timeline. Following AWS guidance, we fully rebuilt affected resources and databases, which restored authentication at 5:24 PM ET and allowed full-service recovery shortly after.

Mitigation & Next Steps
We are conducting a full review of both the deployment safeguards and recovery processes involved in this incident. We have already engaged our AWS infrastructure expert resources to help us in this process.

Our focus is on:

Preventing unintended changes from reaching production environments
Removing the ability for single source failures, while maintain the balanced ability to rapidly deploy updates when necessary
Improving recovery reliability and reducing time to restore service

Posted Apr 23, 2026 - 17:32 UTC

Resolved

The incident was caused by an internal configuration error during routine preparation of new infrastructure. A single AWS flag was incorrectly set, which applied non-production-ready changes to the production environment. Because these changes were propagated across all resources, our regional failover did not activate as expected.

Posted Apr 20, 2026 - 16:00 UTC