FireServiceRota and Brandweerrooster are down
Incident Report for FireServiceRota
Postmortem

This is a postmortem about the two separatee outage of Brandweerrooster (our Dutch system) and FireServiceRota on 1/8/2022 and 2/8/2022.

Brandweerrooster was offline for 1h38 on 1/8/2022 and 0h58 on 2/8/2022.

FireServiceRota was offline for 1h15 on 1/8/2022 and 0h58 on 2/8/2022.

We know this caused our users significant problems and stress, as it did to our team as well. Our sincere apologies for this. We have planned several mitigating actions (see below) to prevent this from happening again.

What happened?

Our team was automatically alerted on both days immediately after the loss of availability. We immediately declared an incident on this status page. After going through our emergency checklists, we discovered the issue was with the London datacenter of our provider. All our systems located in that datacenter were inaccessible.

We notified them immediately and while waiting for updates, we got our checklists ready for failover to our standby location in Frankfurt. Unfortunately, while we have practiced this process many times, we encountered several difficulties in this specific situation:

  1. For security, we have a so-called bastion server through which we access individual servers. This is for extra security (multi-factor authentication), auditing, and access control. However, this server is located in London, and because of this, we weren’t able to use it. As a workaround, our engineers used direct SSH access.
  2. We have tooling in place to automatically update our firewall with the IP addresses of our fleet. This uses the API of our provider, which was producing errors due to the failures in London.
  3. Our team was hesitant to fail over to Frankfurt, because of the significant cleanup involved in moving back to our primary location. Because of this, we waited too long to initiate failover.
  4. Only on 2/8/2022: our Frankfurt failover location was not accessible. Partially due to human error on our side when updating the firewall settings during a cleanup on 1/8/2022, partially due to network issues with our provider.

For a technical description of the problem, please see the postmortem published by our provider.

What we will do to prevent this?

Every time we experience a significant outage, we do our best to learn from this and improve. We have identified the following mitigating actions:

  • We are setting up a failover location with a different provider in a different datacenter. In addition to this, we’ll ensure that failover can take place with less fear of the cleanup activities at a later stage. This means that we’ll be able to initiate failover much quicker, and return to our primary location with minimal loss of availability or data.
  • We are currently in serious conversations with the top level management of our provider. They know the criticality of our business, and acknowledge that their level of service dropped was far below acceptable. We are looking for assurances that this type of issue never happens again, and will convey the difficulties experienced by you, our users.
  • To mitigate item 1 above, we will ensure that all our engineers have direct access to servers in case of an outage of our bastion server.
  • To mitigate item 2 above, we are making this tooling resistent against this type of failures.

For more information, please contact info@fireservicerota.com. We’re happy to answer any questions or doubts you may have.

Posted Aug 03, 2022 - 17:05 UTC

Resolved
This incident has been resolved.
Posted Aug 02, 2022 - 22:40 UTC
Monitoring
Linode's data centres are back online. We will continue monitoring the situation.
Posted Aug 02, 2022 - 22:28 UTC
Update
Our technical team is attempting to perform the fail-over to the Frankfurt data centre but it is currently inaccessible too. We will post updates as soon as we have more updates from Linode.
Posted Aug 02, 2022 - 21:50 UTC
Identified
The issue has been acknowledged by Linode, it is under investigation: https://status.linode.com/incidents/8qnvzrfz7hg2
Posted Aug 02, 2022 - 21:42 UTC
Investigating
We are currently investigating the issue. Linode Data Centres in London and Frankfurt are unreachable.
Posted Aug 02, 2022 - 21:35 UTC
This incident affected: FireServiceRota (FireServiceRota Primary Systems, FireServiceRota Standby Systems) and Brandweerrooster (Netherlands) (Brandweerrooster Primary Systems).