For approximately 30 minutes on 19 Feb 2020, the Webflow dashboard and designer were down and unavailable. The incident started when the primary node within our Redis cluster failed and a failover event occurred. At the time this happened and the failed primary node was removed, we were using an incorrect config value (the now-failed node), causing the kubernetes pods for our main webapp to repeatedly crash.
We first noticed this when one of our engineers saw a spike in bug reports via BugSnag. After a few seconds to verify the issue, he sounded the alarm and we began responding to the incident. Within 3 minutes of sounding the alarm, the team was on a Zoom call and starting the incident response process. About 6 minutes later, we first began to suspect a Redis issue. We ultimately found the misconfigured Redis URL, updated the configuration for the kubernetes pods, and restarted them. At this point, the dashboard and designer responded normally and we considered the incident over.
The API and all hosted sites were unaffected during this incident as they were using the correct Redis host URL.
The troubleshooting process revealed two primary causes of the outage:
Since this incident, we have taken a few steps. First, we've audited all the config values for high-availability services to ensure those URLs are pointing to the proper place. Second, we quickly opened and merged a pull request to wrap the Redis connection in an event handler and gracefully fail. As a final step, we are planning an exercise in our staging environment to replicate the event and positively verify that the app, API, and Hosted sites are not effected by a redis failure in the future.