On Monday, April 27th, we had a production incident that didn’t fit into our normal resolution path. Two network incidents, separated by approximately 55 minutes, led to periods of degraded performance for our application. At the time, these incidents appeared to be unrelated.
The first incident presented at 11:59 PDT when our on-call engineer was paged with an Elasticsearch health issue. Shortly after that initial page, we verified that this was not a false alarm and notified our support team that we have an outage situation. The dashboard was inaccessible with timeouts and 502 responses and we saw a spike in 504 errors from our GraphQL server. As we worked through the triage process, we found a significant number network out requests were timing out and failing as well as abnormally high CPU usage. We immediately saw our systems recover with no intervention on our part. At 12:14 PDT, when all pods and services were healthy again, we changed the resolution on the incident to "monitoring" and kept digging into our monitoring and logging tools to find the source of the issue.
Approximately 55 minutes later, at 13:36 PDT, one of our engineers noticed an increased delay in some of our systems leading to us declaring a second incident. At this point, we were fairly sure the incidents were related. Thinking this was a memory leak, we rolled the kubernetes pods on our production cluster and saw no improvement. Again, we saw a significant number of network out requests were failing and high CPU utilization. At 14:04 PDT, while updating the scaling limits on our cluster, one of our engineers noticed an abnormally high count of network requests coming from one of our internal systems. We quickly isolated the source of the issue and saw all systems gradually recover. By 14:28 PDT, we deployed a fix for the overly verbose component. By 14:45 PDT, we declared the incident resolved.
We learned a few things during the course of this incident. First, we had a gap in our monitoring and had to rely upon secondary effects to narrow down the source of the problem. As a matter of priority, we will work to close those gaps in the coming days. Second, our external communication to customers, as well as internally, needs to improve. We regularly conduct outage drills to practice recovery scenarios, but we'll add an additional level of realism by practicing how we communicate as well. Additionally, we are going to spend some more time formalizing our incident command guidelines to ensure that everyone knows what to do and how to help during an incident.
We apologize for any inconveniences caused by this outage. The primary responsibility on our engineering team is to provide a product and technology experience our customers can rely on. We will be using this outage as a learning opportunity for us in the future.