CMS database degraded performance

Incident Report for Webflow

Postmortem

WHAT HAPPENED

Around 7:10 AM PST, we were originally notified of a failing health check via Pagerduty. We discovered that a majority of non-cached CMS pageviews had begun timing out. After investigation, we determined the problem was ultimately due to extreme database load. At this point we began moving our database to a larger instance size while simultaneously trying to determine the underlying cause.

Between 8:30 AM and 11:00 AM, most page views were no longer timing out, but they were significantly slower than usual.

At 11:00 AM PST, we had confirmed our diagnosis of the underlying problem was correct and began putting a more permanent solution in place. There had been a bug in our Site Search indexing logic that was resulting in multiple indexing jobs to be running at the same time causing significantly more database load than we were prepared for.

IMPACT OF OUTAGE

Hosted sites making use of CMS data were impacted. We do however aggressively cache content with Fastly and roughly 90% of all hosted pageviews are served from the cache. Whenever a site is published, we clear the cache for the site. This means if a site was not re-published during this time period, it likely would not have been impacted.

NEXT STEPS

In order to prevent these issues from happening again, we have taken a number of steps. We put a fix in place to prevent multiple Site Search indexing jobs from being run at the same time. We are running our database cluster on larger instances which are able to handle significantly more load. Additionally, we have upgraded to a newer version of our database software which lets us create more performant indexes.

Posted Mar 20, 2018 - 22:14 UTC

Resolved

As of 6:59pm UTC, we have not received any additional reports or error logs of CMS content failing to render on published sites, and now consider this issue fully resolved. We'll be publishing a detailed postmortem soon, which will include steps that we are taking to prevent similar issues in the future.

Posted Mar 07, 2018 - 21:42 UTC

Monitoring

Our engineers have identified and resolved the database performance issue, and we're now monitoring our infrastructure to ensure that all CMS pages continue rendering correctly. We will publish a more detailed postmortem once we've fully confirmed that this issue is resolved.

Posted Mar 07, 2018 - 19:07 UTC

Investigating

We're currently investigating a database performance issue that is causing issues with reading and writing from our CMS database. Some uncached page renders on Webflow hosted sites that contain CMS content may timeout and return a 514 error code.

Posted Mar 07, 2018 - 15:25 UTC