Schedule Caching Issue

Incident Report for FireServiceRota

Postmortem

Schedule Inconsistency

Date: March 30, 2026

Duration: approx 7 hours

Status: Resolved

Impact: Schedules were either displayed incorrectly or failed to load. While the System of Record (Database) remained accurate and intact, the Presentation Layer (Cache) served stale or corrupted data.

Executive Summary

Users experienced issues viewing or updating shift schedules. The underlying cause was a disk-space failure in our caching servers, which prevented the cache from updating or persisting data correctly. We restored service by deploying a fresh cache instance in our Kubernetes cluster and have implemented long-term monitoring to prevent recurrence.

The Timeline

  • 01:00 UTC (approx): Cache disk volume reached 100% capacity, causing intermittent write-failures.
  • 07:50 UTC: Monitoring alerts triggered and first user tickets come in.
  • 08:01 UTC: DevOps team alerted
  • 08:07 UTC: Decision made to bypass the stalled instance and deploy a clean cache installation via Kubernetes.
  • 08:10 UTC: Cache repopulated; schedule accuracy verified across all stations.

Technical Root Cause

Recently, we updated our infrastructure and migrated it to new EU-owned providers. As with all major changes, the new infrastructure has presented some new challenges. While most of these challenges have been addressed well before impacting users, this issue in particular caused degraded performance.

In more detail, our caching servers use disk space for AOF (Append Only File) persistence and snapshots. Once the disk reached capacity, it could no longer log new write operations. This led to a "split-brain" scenario where the cache was out of sync with the primary database, resulting in the display of incorrect schedules for crews.

Corrective Actions

1. Immediate Mitigation (Completed)

  • Kubernetes Redeployment: We provisioned a fresh caching cluster within our K8s environment to bypass the corrupted storage volume and restore immediate service.

2. Root Cause Prevention (Completed)

While we were already automatically monitoring many cache related metrics to ensure its stable operation, disk storage was not yet one of these. To mitigate this:

  • Storage Expansion: We have increased the provisioned IOPS and disk size for all caching layers to provide a 3x buffer over current peak usage.
  • Advanced Monitoring: We implemented granular disk-usage tracking.
  • Automated Alerting: We have configured a "Critical" alert in our 24/7 Alerting system that triggers when disk usage hits 75%, allowing the DevOps team to intervene hours before a failure occurs.
Posted Mar 31, 2026 - 22:20 UTC

Resolved

Users experienced issues viewing or updating shift schedules. The underlying cause was a disk-space failure in our caching servers, which prevented the cache from updating or persisting data correctly. We restored service by deploying a fresh cache instance in our Kubernetes cluster and have implemented long-term monitoring to prevent recurrence.
Posted Mar 30, 2026 - 05:00 UTC