Around 7:10 AM PST, we were originally notified of a failing health check via Pagerduty. We discovered that a majority of non-cached CMS pageviews had begun timing out. After investigation, we determined the problem was ultimately due to extreme database load. At this point we began moving our database to a larger instance size while simultaneously trying to determine the underlying cause.
Between 8:30 AM and 11:00 AM, most page views were no longer timing out, but they were significantly slower than usual.
At 11:00 AM PST, we had confirmed our diagnosis of the underlying problem was correct and began putting a more permanent solution in place. There had been a bug in our Site Search indexing logic that was resulting in multiple indexing jobs to be running at the same time causing significantly more database load than we were prepared for.
Hosted sites making use of CMS data were impacted. We do however aggressively cache content with Fastly and roughly 90% of all hosted pageviews are served from the cache. Whenever a site is published, we clear the cache for the site. This means if a site was not re-published during this time period, it likely would not have been impacted.
In order to prevent these issues from happening again, we have taken a number of steps. We put a fix in place to prevent multiple Site Search indexing jobs from being run at the same time. We are running our database cluster on larger instances which are able to handle significantly more load. Additionally, we have upgraded to a newer version of our database software which lets us create more performant indexes.