March 10 incident post mortem
- Buster Styren
This week we experienced a major outage that affected the service of all our users.
Firstly, we want to apologize profusely for the impact this outage may have caused. We’re a small provider with a small but growing team at an early stage. We pride ourselves in offering a competitive service and being reliable is therefore critical to establishing trust with you as an end user.
This post-mortem details how we were made aware of the outage, the steps we took to solve it and the mitigations and learnings that came out of it. We have attached a detailed timeline of all events and the steps we took to resolve the issue. All timestamps are in CET.
From 12:19 the 10th of March until 05:13 the 11th of March there was an outage of our entire storage cluster, which prohibited any read or writes to nodes and PVs. This was caused by fragmentation of multiple drives that prevented storage nodes from allocating additional 64K memory blocks for metadata, despite affected drive having more than 115GiB of capacity.
The fragmentation and the low amount of drives in the cluster caused the allocation failure to bring the cluster into an inoperable state. We have doubled the amount of drives and we are adding monitoring to track the level of fragmentation.
On 9th March we got alerted that our storage cluster was over 85% capacity. In response, we planned to increase the capacity the next day.
At 12:19 on 10th March, we added a first additional volume to the storage cluster. At this point the storage cluster was at 86% capacity.
Shortly after the first volume was added, two-thirds of our storage nodes went down.
At 12:36 the storage cluster was completely unresponsive to any requests.
We had two initial hypothesis at this point:
- A networking failure was prohibiting hosts from communicating
- The new node caused a rebalancing that moved objects onto already full storage nodes
A networking issue was quickly ruled out as the BGP peering was active and root servers between the affected nodes were reachable over the network used by the storage cluster.
The logs for the storage managers turned up dry, only reporting on the degraded storage pools.
Next we looked at the status for the affected storage nodes. They were all in a restart crash loop. The stack trace indicated an allocation failure for BlueStore, which manages the write-ahead log and the metadata store.
We had now identified the problem to be the inability to allocate more memory for metadata. We had to find a way to either free up space on the storage node, move objects in order to defragment it or find a way to start the node while allocating less memory for metadata.
A storage node is a normal block device, therefore the metadata database uses a free-list to figure out where it can allocate space on the device. If it requests too big of an allocation it may fail if there are no empty spaces to put it. Our goal was to see if:
- we could reduce the amount it allocated
- start it without any allocation being made at all or,
- whether we could defragment the list to introduce additional blocks for allocation.
Our first attempt was to change the type of memory allocator.
At 13:12 our first attempt was to change the allocation algorithm. We hoped that we could start the nodes and quickly reshuffle storage over to the new node while removing superfluous or redundant objects.
However changing the allocator didn’t have any impact, the nodes were completely jammed and neither defragmentation nor removing objects were possible due to the lack of allocatable memory.
Next, we looked into using Bluestores tooling for repairing and defragmenting the metadata database. We had to be careful to not do any destructive operations that might jeopardise or corrupt user data.
At 14:22 we called our hosting provider to have additional disks installed on the affected hosts with the goal of copying the complete node over to the new disk and expand it. Our hosting provider required 2 hours to get the new disks installed.
The nodes have a ceiling on the amount of memory that can be written to it before it fails, this is by default set to 95% and is the value that we use as well. At 15:48 we tried to increase the value to 97% in order to try to create a wedge that would get it to start. It didn’t work as the limitation wasn’t imposed by software but rather by the allocatable memory on the device.
At 16:02 we had access to new disks on the affected hosts. We cloned the failing nodes onto the new disks, extend the drive and finally expand the metadata database.
Since the storage nodes are placed on logical volumes we had a problem: cloning the devices would mean two identical logical partitions which would confuse the OS. We would have to tread carefully in order for that confusion to not corrupt the disks or even impact the ability for the servers to boot normally.
At 18:22 we had defined a migration plan set and we booted the servers into rescue mode in order to configure the LVM partitions.
At 18:45 we restarted the affected servers with the new storage nodes configured. At this point the cluster started properly.
At 19:10 the storage cluster initialised a global recovery event in order to reshuffle and backfill the storage nodes with the missing data. However, at 19:15 the third and previously unaffected host had a node put in the same crash loop due to fragmentation.
Now, this should normally be okay if there’s just a single node that is down. But the recovery only succeeded for one of the nodes, which meant that any objects that were only replicated on the node that failed to recover, and the now crashed node, were unavailable. The cluster could not resume operations until these objects were made available. The cluster was once again put out of service.
The recovery event stalled at circa 20:23.
At this point it was not entirely clear why the recovery had stopped and whether or not there was any objects missing from the active storage nodes. We had to figure out both why it stopped, if there was any missing data and if there was a way for us to jump start the recovery again.
At 21:12 we wanted to verify that the recovery wasn’t stalled due to any IO bottleneck. Therefore we stopped any running VMs and other processes that used the storage cluster, however this did not impact the recovery of the degraded, missing or misplaced objects.
We were able to conclusively identify that 28% of all objects in the clusters were missing at 21:59. This meant that these objects were not stuck on the failing node.
At 23:05 we were sure that we now had objects stuck in the failing node without any redundancy at all. We decided to look into how we could backup this volume so that any mistakes during recovery would not lead to data loss.
However, it would take too long to transfer the data securely to another host. So at 23:18 we called our hosting provider to ask them to install additional disks in order to perform the same recovery operation as we did for the previously broken nodes. We were told we had to wait up to two more hours in order to get them installed.
At 00:48 we had received the additional disk and proceeded with cloning the node volume onto it and extending it in order to be able to access it without running into any allocation issues.
At 01:36 we had identified all objects that were currently unavailable and at risk of being lost. We backed up all missing objects on the root filesystem.
At 02:25 we attempted to move the extracted objects and import them on one of the healthy nodes. However after moving all objects the import tool incorrectly reported the objects as already present.
At this point we were confident that even though we couldn’t import the objects that they were still properly exported, therefore we could perform the same recovery procedure as the previous time around, as even if we encountered a failure we could still hopefully restore the missing objects later on.
At 03:19 we booted the host in rescue mode. Just like the previous time we cloned the disk and configured the LVM metadata in order for the storage cluster to properly identify it.
At 04:18 the cluster was finally reporting itself as healthy.
At 05:13, after seeing that nodes started to boot successfully and that there was a low risk of data loss, we sent an email to affected users to notify that the core issue had been resolved.
The adding of a new drive created new backfill jobs, this led to the existing storage nodes allocating additional memory to handle the metadata necessary for the backfill to complete.
Despite the drives having more than 100GiB of storage left they were so heavily fragmented that the allocator wasn’t able to allocate the 64K block it needed, leading to a crash loop.
debug 2023-03-10T12:15:42.412+0000 7ff3e72b0080 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type bitmap, capacity 0xdf90000000, block size 0x1000, free 0x1c9e9c0000, fragmentation 1, allocated 0x30000
Due to the small size of the storage cluster the fragmentation affected multiple nodes at the same time which led to an outage of the full storage cluster. Since nodes and PVs both rely on the storage cluster it caused all IO operations on Kubernetes nodes and control planes to fail.
When all nodes and control planes were brought up simultaneously it caused some clusters to not boot properly. The reason is that the control planes connect to CoreDNS to resolve service names. However, when CoreDNS was offline its IP was still routable, this resulted in DNS lookups taking very long time before timing out. As a result the cluster couldn’t bootstrap properly.
Also, some nodes that were brought up failed to attach their persistent volumes after the service was brought back up and had to be manually recycled.
In hindsight this problem would have been avoided if there were more nodes in the storage clusters. We have doubled the amount of disks since the incident.
We have added monitoring to automatically track the fragmentation of our storage nodes and alert if the fragmentation crosses a threshold.
We have changed our alerting to trigger once the capacity reaches 80%, as opposed to the previous threshold of 75%.
We are also lowering the timeout of DNS lookups against CoreDNS to ensure that the control planes can quickly default to a working nameserver in case of a cold boot.
This incident could have been avoided if the storage cluster was larger, in which case two nodes failing wouldn’t have caused this global outage. This is a failure on our end. We should’ve made sure the cluster was big enough to have the durability necessary for business critical workloads, even if the overall usage would’ve been low.
We’ve set ourselves out to offer a simpler, faster and cheaper k8s service. And in order to meet that goal we under-provisioned volumes in order to keep costs and prices for users down. We should have evaluated carefully whether the small number of physical volumes could have an impact on the durability of the storage cluster.
Also, we never access root servers directly under normal circumstances. Instead we rely on IaC and CaC to set-up servers and deploy a server daemon to provision and manage VMs, routing and other configuration. In this case we had to “break the glass” and access the servers directly in order to solve the issue. As a result we had to be extremely careful in order to avoid any potential data loss as we were dealing with a very sensitive part of our stack, therefore the incident took a long time to properly resolve.