Hosted Site and Uploaded Image Outage

Incident Report for Webflow

Postmortem

AWS S3 Outage Postmortem

What happened

Around 9:45 a.m. PST, we detected anomalies in our site publishing. We discovered that Amazon AWS S3 service was failing with errors, and the Webflow app started seeing issues uploading new files to S3. This caused issues with file uploads and site snapshots, as we rely on AWS S3 as a file system for user sites.

Around 10:15 a.m. PST, AWS reported elevated S3 error rates, and it was not until 1:33 p.m. PST that S3 started working more predictably.

Impact of outage

This was a prolonged outage that prevented thousands of the largest sites on the internet from working properly. Sites hosted with Webflow that were recently published may have failed to load properly, or would have showed 504 errors. The Webflow Designer also failed to load properly for many users, as we use S3 to save and load backups of sites. Many sites were still operating without issue, but that was due to Fastly serving cached versions of the site. However, as the TTLs on those sites and images expired on Fastly and Amazon Cloudfront (where images are served), and requests to retrieve the assets from the source (S3) failed, more and more sites and images would fail to load as the S3 outage continued into the afternoon.

Next steps

The Webflow infrastructure and engineering teams have learned that using S3 (which is sold at 99.999% availability) as a single source of truth for all customer assets is not enough to meet the needs of Webflow customers. After the outage subsided, the engineering team met and put in place a plan to add additional redundancy measures such as:

Multi-region asset transfer and mirroring. This will help switch our CDN to pull assets from another AWS S3 region that is not affected.
Multi-region proxy & render servers. This will help our proxy and render server fleets remain stable in case one region goes down completely.

Situations like these are difficult for everyone involved, and we apologize for the downtime and inconvenience this outage caused. We're working hard on making Webflow Hosting an even more reliable place to host your sites. Thank you for your continued support!

In the meantime, please make sure your custom domains are using the most up-to-date DNS records, which you can find in our support article on how to set up custom domain hosting for your Webflow site.

Posted Mar 01, 2017 - 22:51 UTC

Resolved

Amazon has confirmed that the S3 issue has been resolved, and we're seeing hosting, image uploading, and site publishing come back to normal. If you're still seeing issues with your site, please contact us at support@webflow.com - thank you for your patience, and happy designing!

Posted Feb 28, 2017 - 23:11 UTC

Update

All systems should be back to normal now, but we're still monitoring the situation and waiting for Amazon to give the all-clear. If you're seeing any downtime or issues during publishing or image upload, please contact us at support@webflow.com

Posted Feb 28, 2017 - 22:14 UTC

Update

Site publishing and image uploads should be working as normal now.

Posted Feb 28, 2017 - 21:55 UTC

Update

Hosted sites should now be back online, with images/videos/styles loading correctly. If your site is not rendering properly, please contact our support team at support@webflow.com with the domain name and our team will investigate.

The Amazon Web Services team is still working on restoring the ability to write new files to S3, which means uploading new images/videos is not yet working. Also, publishing sites could fail with a "We encountered an internal error. Please try again." error. We will update this page as soon as we have confirmed that uploads and publishing are fully operational.

Posted Feb 28, 2017 - 21:45 UTC

Update

Amazon has reported that retrieving objects from S3 is functional and fully recovered. This means that your hosted sites should be back online, and images / background videos / CSS / JS is also loading. They are still working on restoring service to creating new objects on S3, which means uploading images and new background videos may be under degraded performance.

Posted Feb 28, 2017 - 21:20 UTC

Update

More assets are being returned by S3, however, since our CDN (Cloudfront) caches S3's error responses, we're adding some rules to prevent that from happening. We will also do a complete cache invalidation on Cloudfront to ensure all images are being loaded correctly from source.

Posted Feb 28, 2017 - 21:06 UTC

Update

We're currently seeing valid responses from S3, and will be slowly invalidating the caches. Some sites may still be down or rending improperly. The Webflow design tool is also not 100% functional, with image uploads likely failing and certain images not loading properly. We'll be keeping a very close eye on AWS and to see if it has affected other services.

Posted Feb 28, 2017 - 20:47 UTC

Update

AWS has reported that they have identified a potential fix for S3 and will be implementing it now.

Posted Feb 28, 2017 - 19:48 UTC

Monitoring

We have upped the cache timeout to improve the performance of sites that are still rendering. Certain sites (or certain pages of sites) that are not rendering won't be working until AWS resolves S3's issues. Users in different locations may see pages loading still as they may be hitting caches that have not been invalidated yet.

The Webflow design tool is also currently experiencing issues loading and syncing. We recommend holding off on accessing your Webflow projects for the time being until Amazon AWS issues are resolved.

Posted Feb 28, 2017 - 19:44 UTC

Identified

AWS is currently working on bringing back service to S3 and we're closely monitoring our services. Certain images in the designer tool also may not be loading as they are being loaded from S3.

Posted Feb 28, 2017 - 18:42 UTC

Monitoring

We're closely monitoring AWS's status page, where Amazon has identified elevated error rates on their S3 service. http://status.aws.amazon.com/

Posted Feb 28, 2017 - 18:03 UTC

Investigating

We're looking into an issue with our upstream file hosting provider (AWS S3). This is currently affect site publishing, and requests to reload fresh content on your webflow site. You may experience 503 errors if you've recently published your site.

Posted Feb 28, 2017 - 17:55 UTC