Degraded performance and downtime on CMS sites
Incident Report for Webflow

What happened

A DoS started happening against a Webflow site around 2/11 7pm CET. This affected sites that were published during this time, causing 503 "First Byte Errors", which meant our render cluster could be render dynamic (CMS) content and return it to our caching layer.

How we fixed it

We applied fixes to our caching layer to prevent the DoS from passing on traffic to our render cluster. We also added more capacity to our render cluster.

What went wrong

Certain metrics were not being properly checked, and we've amended our infrastructure monitoring and system alerts to properly notify on-call engineers in the future, to reduce customer impact and downtime.

Posted 5 months ago. Feb 23, 2017 - 21:53 UTC

Resolved
We encountered a large amount of requests that bypassed our caching layer which caused downtime and degraded performance on our render cluster. This lasted for about 60 minutes on 2/11 starting around 7pm CET, but only affected sites that were published during this window. We have added protections and added more capacity to our render cluster. We have also amended our alerting and monitoring so that our engineers can be notified earlier of sudden load.
Posted 5 months ago. Feb 23, 2017 - 21:49 UTC