Errors accessing the Webflow Dashboard and Designer

Incident Report for Webflow

Postmortem

On Monday, April 27th, we had a production incident that didn’t fit into our normal resolution path. Two network incidents, separated by approximately 55 minutes, led to periods of degraded performance for our application. At the time, these incidents appeared to be unrelated.

The first incident presented at 11:59 PDT when our on-call engineer was paged with an Elasticsearch health issue. Shortly after that initial page, we verified that this was not a false alarm and notified our support team that we have an outage situation. The dashboard was inaccessible with timeouts and 502 responses and we saw a spike in 504 errors from our GraphQL server. As we worked through the triage process, we found a significant number network out requests were timing out and failing as well as abnormally high CPU usage. We immediately saw our systems recover with no intervention on our part. At 12:14 PDT, when all pods and services were healthy again, we changed the resolution on the incident to "monitoring" and kept digging into our monitoring and logging tools to find the source of the issue.

Approximately 55 minutes later, at 13:36 PDT, one of our engineers noticed an increased delay in some of our systems leading to us declaring a second incident. At this point, we were fairly sure the incidents were related. Thinking this was a memory leak, we rolled the kubernetes pods on our production cluster and saw no improvement. Again, we saw a significant number of network out requests were failing and high CPU utilization. At 14:04 PDT, while updating the scaling limits on our cluster, one of our engineers noticed an abnormally high count of network requests coming from one of our internal systems. We quickly isolated the source of the issue and saw all systems gradually recover. By 14:28 PDT, we deployed a fix for the overly verbose component. By 14:45 PDT, we declared the incident resolved.

We learned a few things during the course of this incident. First, we had a gap in our monitoring and had to rely upon secondary effects to narrow down the source of the problem. As a matter of priority, we will work to close those gaps in the coming days. Second, our external communication to customers, as well as internally, needs to improve. We regularly conduct outage drills to practice recovery scenarios, but we'll add an additional level of realism by practicing how we communicate as well. Additionally, we are going to spend some more time formalizing our incident command guidelines to ensure that everyone knows what to do and how to help during an incident.

We apologize for any inconveniences caused by this outage. The primary responsibility on our engineering team is to provide a product and technology experience our customers can rely on. We will be using this outage as a learning opportunity for us in the future.

Posted Apr 29, 2020 - 00:24 UTC

Resolved

We discovered a bug with our application stack that was causing increased network and CPU load, leading to intermittent failures to load the dashboard, sync projects, and publish sites. We quickly patched and rolled out a fix which will prevent this from happening again in the future.

Posted Apr 27, 2020 - 22:16 UTC

Monitoring

We've deployed a fix for the issue and are now monitoring performance.

Posted Apr 27, 2020 - 21:34 UTC

Update

We've identified the source of the problem and are working on solutions.

Posted Apr 27, 2020 - 21:09 UTC

Identified

We've identified an issue causing a Dashboard and Designer outage. We're actively working on a resolution.

Posted Apr 27, 2020 - 20:43 UTC

Investigating

We are investigating another occurrence of the same issue as before.

Posted Apr 27, 2020 - 20:35 UTC

Update

We are continuing to monitor for any further issues.

Posted Apr 27, 2020 - 20:34 UTC

Update

We are continuing to monitor for any further issues.

Posted Apr 27, 2020 - 20:23 UTC

Monitoring

All systems are operational at this time. We're actively monitoring performance.

Posted Apr 27, 2020 - 20:14 UTC

Identified

We're investigating an issue causing an outage with our Dashboard and accessing the Designer.

Posted Apr 27, 2020 - 19:20 UTC

This incident affected: Dashboard.