Webflow editor outage
Incident Report for Webflow
Postmortem

What Happened

At approximately 1230 UTC on 24 January 2018, one of our Redis databases servers attempted to save some new data to disk and ran out of memory during the save. This caused Redis to go into a data protection mode where no new write requests were accepted, which cascaded into authentication errors for users when they attempted to open the Editor from the Dashboard or change their password. Webflow engineers were alerted to this by the Support team, who had received a few complaints from customers regarding the issue. The engineers quickly diagnosed the problem and added extra server capacity to the database cluster to remedy the problem. No data was lost or endangered during the outage and the service is once again running normally now.

What Went Wrong

Several engineering failures allowed this process to happen:

  • Our monitoring of application function (from the customer's perspective) is insufficient. Because of this, we did not have any automated notification that customers were experiencing issues with the app.

  • We did not have sufficient monitoring in place to detect the low memory situation. This could have been easily remedied without causing an outage if we had known of the impending problem.

Next Steps

We're taking the following steps to prevent this issue from happening in the future:

  • We are going to be making improvements to our monitoring system to ensure that we have proper fault detection around all services that are critical to the app's function.

  • We're also going to improve our server-level monitoring to ensure that we can detect and remedy resource exhaustion issues before they cause service outages.

We sincerely apologize for the downtime and inconvenience this issue caused for those affected. We're working hard on making your Webflow experience fast and reliable. Thank you for your continued support!

Posted Jan 25, 2018 - 04:36 UTC

Resolved
At approximately 1230 UTC on 24 January 2018, Webflow experienced a database service outage that caused users to experience authentication errors when accessing various functions of the app. Webflow engineers quickly traced to a memory exhaustion issue on a database server and had a fix in place by 1330 UTC. All systems have been functioning normally since 1330 and no data was lost. We sincerely apologize for any issues you may have experienced.

Postmortem to follow.
Posted Jan 25, 2018 - 03:26 UTC