Major Outage

Incident Report for FireServiceRota

Postmortem

This is the postmortem of the bug of Wednesday, April 20th, 2022.

What went wrong:

At approximately 13:00 MX / 19:00 FireServiceRota went down. Any attempt to access the system resulted in server errors.
Our hardware infrastructure located in London presented a failure. Usually our system is capable of restoring automatically when failures happen, but this time it was unable to do so.
The system had a downtime of approximately 20 minutes, being restored at around 13:21 MX / 19:21 UK.
As reported by our hosting provider the issue was a widespread problem on their own platforms. We maintained close contact with them to raise awareness of this issue and identifying mitigating actions on our end.

What actions were taken to mitigate the issue:

We migrated our infrastructure to from the faulty servers (hardware) to new ones.
In collaboration with our hosting provider, we were able to identify a lead on the root cause of the failure. To corroborate the root cause, further investigation by our provider was needed.
The root cause was confirmed and reported back to us 2022-04-21 16:59 UK time.
We’ve identified some measures to minimize the impact should another external failure impact us:
- Harden our protocol checklist for acting on major outages. This will allow a larger group of team members to take action faster.
- We will perform an upgrade to the technology stack of our infrastructure to the latest version.

Posted Apr 22, 2022 - 16:16 UTC

Resolved

This incident has been resolved.

Posted Apr 20, 2022 - 18:28 UTC

Monitoring

A fix has been implemented and we are monitoring the system.

Posted Apr 20, 2022 - 18:22 UTC

Identified

The issue has been identified and we are working on implementing a fix.

Posted Apr 20, 2022 - 18:08 UTC

Investigating

We are currently investigating the issue.

Posted Apr 20, 2022 - 18:03 UTC

This incident affected: FireServiceRota (FireServiceRota Primary Systems).