Redis failover

Incident Report for Webflow

Postmortem

For approximately 30 minutes on 19 Feb 2020, the Webflow dashboard and designer were down and unavailable. The incident started when the primary node within our Redis cluster failed and a failover event occurred. At the time this happened and the failed primary node was removed, we were using an incorrect config value (the now-failed node), causing the kubernetes pods for our main webapp to repeatedly crash.

We first noticed this when one of our engineers saw a spike in bug reports via BugSnag. After a few seconds to verify the issue, he sounded the alarm and we began responding to the incident. Within 3 minutes of sounding the alarm, the team was on a Zoom call and starting the incident response process. About 6 minutes later, we first began to suspect a Redis issue. We ultimately found the misconfigured Redis URL, updated the configuration for the kubernetes pods, and restarted them. At this point, the dashboard and designer responded normally and we considered the incident over.

The API and all hosted sites were unaffected during this incident as they were using the correct Redis host URL.

The troubleshooting process revealed two primary causes of the outage:

the URL for webapp Redis host was pointed at a single node rather than the cluster load balancer
we were not catching errors arising outside of the request/response cycle when in the context of the long-running Redis connect

Since this incident, we have taken a few steps. First, we've audited all the config values for high-availability services to ensure those URLs are pointing to the proper place. Second, we quickly opened and merged a pull request to wrap the Redis connection in an event handler and gracefully fail. As a final step, we are planning an exercise in our staging environment to replicate the event and positively verify that the app, API, and Hosted sites are not effected by a redis failure in the future.

Posted Feb 26, 2020 - 05:44 UTC

Resolved

The dashboard and designer are unavailable. After login, a 502 error is displayed. App is down

Posted Feb 19, 2020 - 17:46 UTC