At approximately 1230 UTC on 24 January 2018, one of our Redis databases servers attempted to save some new data to disk and ran out of memory during the save. This caused Redis to go into a data protection mode where no new write requests were accepted, which cascaded into authentication errors for users when they attempted to open the Editor from the Dashboard or change their password. Webflow engineers were alerted to this by the Support team, who had received a few complaints from customers regarding the issue. The engineers quickly diagnosed the problem and added extra server capacity to the database cluster to remedy the problem. No data was lost or endangered during the outage and the service is once again running normally now.
Several engineering failures allowed this process to happen:
Our monitoring of application function (from the customer's perspective) is insufficient. Because of this, we did not have any automated notification that customers were experiencing issues with the app.
We did not have sufficient monitoring in place to detect the low memory situation. This could have been easily remedied without causing an outage if we had known of the impending problem.
We're taking the following steps to prevent this issue from happening in the future:
We are going to be making improvements to our monitoring system to ensure that we have proper fault detection around all services that are critical to the app's function.
We're also going to improve our server-level monitoring to ensure that we can detect and remedy resource exhaustion issues before they cause service outages.
We sincerely apologize for the downtime and inconvenience this issue caused for those affected. We're working hard on making your Webflow experience fast and reliable. Thank you for your continued support!