Date: March 30, 2026
Duration: approx 7 hours
Status: Resolved
Impact: Schedules were either displayed incorrectly or failed to load. While the System of Record (Database) remained accurate and intact, the Presentation Layer (Cache) served stale or corrupted data.
Users experienced issues viewing or updating shift schedules. The underlying cause was a disk-space failure in our caching servers, which prevented the cache from updating or persisting data correctly. We restored service by deploying a fresh cache instance in our Kubernetes cluster and have implemented long-term monitoring to prevent recurrence.
Recently, we updated our infrastructure and migrated it to new EU-owned providers. As with all major changes, the new infrastructure has presented some new challenges. While most of these challenges have been addressed well before impacting users, this issue in particular caused degraded performance.
In more detail, our caching servers use disk space for AOF (Append Only File) persistence and snapshots. Once the disk reached capacity, it could no longer log new write operations. This led to a "split-brain" scenario where the cache was out of sync with the primary database, resulting in the display of incorrect schedules for crews.
While we were already automatically monitoring many cache related metrics to ensure its stable operation, disk storage was not yet one of these. To mitigate this: