Monday, April 25, 2011

An unofficial EC2 outage postmortem - the sky is not falling

Last week Amazon Web Services (AWS) experienced a high profile outage affecting Elastic Cloud Compute (EC2) and Elastic Block Storage (EBS) in 1 of 4 data centers in the US East region. This outage caused some high profile website outages including Reddit, Quora and FourSquare and scores of negative PR. In the proceeding days media outlets and bloggers have written literally hundreds of articles such as Amazon's Trouble Raises Cloud Computing Doubts (New York Times), The Day The Cloud Died (Forbes), Amazon outage sparks frustration, doubts about cloud (Computerworld), and many others.

EC2 and EBS in a nutshell

In case you are not familiar with the technical jargon and acronyms, EBS is one of two methods provided by AWS for setting up an EC2 instance (an EC2 instance is essentially a server) storage volumes (basically a cloud hard drive). Unlike a traditional hard drive that is located physically inside of a computer, EBS is stored externally on dedicated storage boxes and connected to EC2 instances over a network. The second storage option provided by EC2 is called ephemeral, which uses this more traditional method of hard drives located physically inside the same hardware that an EC2 instance runs on. Using EBS is encouraged by AWS and provides some unique benefits not available with ephemeral storage. One such benefit is the ability to recover quickly from a host failure (a host is the hardware that an EC2 instance runs on). If the host fails for an EBS EC2 instance, it can quickly be restarted on another host because its storage does not reside on the failed host. On the contrary, if the host fails for an ephemeral EC2 instance, that instance and all of the data stored on it will be permanently lost. EBS instances can also be shutdown temporarily and restarted later, whereas ephemeral instances are deleted if shut down. EBS also theoretically provides better performance and reliability when compared to ephemeral storage.

Other technical terms you may hear and should understand regarding EC2 are virtualization and multi-tenancy. Virtualization allows AWS to run multiple EC2 instances on a single physical host by creating simulated "virtual" hardware environments for each instance. Without virtualization, AWS would have to maintain a 1-to-1 ratio between EC2 instance and physical hardware, and the economics just wouldn't work. Multi-tenancy is a consequence of virtualization in that multiple EC2 instances share access to physical hardware. Multi-tenancy often causes performance degradation in virtualized environments because instances may need to wait briefly to obtain access to physical resources like CPU, hard disk or network. The term noisy neighbor is often used to describe this scenario in very busy environments where virtual instances are waiting frequently for physical resources causing noticeable declines in performance.

EC2 is generally a very reliable service. Without a strong track record high profile websites like Netflix would not use it. We conduct ongoing independent outage monitoring of over 100 cloud services which shows 3 of the 5 AWS EC2 regions having 100% availability the past year. In fact, our own EBS backed EC2 instance in the affected US East region remained online throughout last week's outage.

AWS endorses a different type of architectural philosophy called designing for failure. In this context, instead of deploying highly redundant and fault tolerant (and very expensive) "enterprise" hardware, AWS uses low cost commodity hardware and designs their infrastructure to expect and deal gracefully with failure. AWS deals with failure using replication. For example, each EBS volume is stored on 2 separate storage arrays. In theory, if one storage array fails, its' volumes are quickly replaced with the backup copies. This approach provides many of the benefits of enterprise hardware, such as fault tolerance and resiliency, while at the same time providing substantially lower hardware costs enabling AWS to price their services competitively.

The outage - what went wrong?

Disclaimer: This is our own opinion of what occurred during last week's EC2 outage based on our interpretation of the comments provided on the AWS Service Health Dashboard and basic knowledge of the EC2/EBS architecture.

At about 1AM PST on Thursday April 21st, one of the four availability zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS. This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as "re-mirroring"). While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own. All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.

The sky is not falling

Despite what some media outlets, bloggers and AWS competitors are claiming, we do not believe this event is reason to question the viability AWS, external instance storage, or the cloud in general. AWS has stated they will evaluate closely the events that triggered this outage, and apply appropriate remedies. The end result will be a more robust and battle hardened EBS architecture. For users of AWS affected by this outage, this should be cause to re-evaluate their cloud architecture. There are many techniques suggested by AWS and prominent AWS users that will help to deal with these types of outages in the future without incurring significant downtime. These include deploying load balanced servers across multiple availability zones and using more than one AWS region.

Netflix is a large and very visible client of AWS that was not affected by this outage. The reason for this is that they have learned to design for failure. In a recent blog post, Adrian Cockroft (Netflix's Cloud Architect), wrote about some of the technical details and shortcomings of EBS. At a high level, the take away points from his post are:

  • EC2, EBS and the network that attach them are all shared resources. As such, performance will vary significantly depending on multi-tenancy and shared load. Performance variance will be greater on smaller EC2 instances and EBS volumes where multi-tenancy is a greater factor
  • Users can reduce the potential affects of multi-tenancy by using larger EC2 instances and EBS volumes. To reduce EBS mulit-tenancy, Netflix uses the largest possible volume size, 1TB. Because each EBS storage array has a limited amount of storage capacity, using larger sized volumes reduces the number of other users that may share that hardware. The same is true of larger EC2 instances. In fact, the largest EC2 instances (any of the 4xlarges) run on dedicated hardware. Because each physical EC2 host has one shared network interface, use of larger EBS volumes and EC2 instances also has the added benefit of increased network throughput
  • Use ephemeral storage on EC2 instances where predictable and consistent performance is necessary. Netflix uses ephemeral storage for their Cassandra datastore and has found it to be more consistently reliable compared to EBS

Too early to throw in the towel

AWS is not alone in experiencing performance and reliability issues with external storage. Based on our independent monitoring Visi, GigeNet, Tata InstaCompute, Flexiscale, Ninefold and VPS.NET have all experienced similar outages. Our monitoring shows that external storage failures are a very significant cause of cloud outages. When external storage systems fail, vendors often have a very difficult time recovering quickly. Designing fault tolerant and performant external storage for the cloud is a very complex problem, so much so that many vendors including Rackspace Cloud and Joyent avoid it entirely. Joyent for example, recently documented their unsuccessful attempt to deploy external storage in their cloud service. However, despite the complexity of this problem, we believe it is far too early for cloud vendors and users to throw in the towel. There are significant advantages to external storage versus ephemeral including:

  • Host failure tolerance: If the power supply, motherboard, or any component of a host system fails, the instances running on it can be quickly migrated to another host
  • Shutdown capability: With most providers, external storage instances can be shutdown temporarily and then incur only storage fees
  • Greater flexibility: External storage offers features and flexibility generally unavailable with ephemeral storage. These may include the ability to backup volumes, create snapshots, clone, create custom OS templates, resize partitions and attach multiple storage volumes to a single instance

Innovation in external storage

Besides AWS, there are other providers innovating in the external storage space. OrionVM, a cloud startup in Australia, has developed their own distributed, horizontally scalable, external storage architecture based on a high performance communication link called Infiniband. Instead of using dedicated storage hardware, OrionVM uses the same hardware for both storage and server instances. The server instances use storage located on multiple external hosts connected to it via redundant 40 Gb/s InfiniBand links. If a physical host fails, the instances running on it can be restored on another host because their storage resides externally. OrionVM also replicates storage across multiple host systems allowing for fault tolerance should a storage host fail. This hybrid approach combines the benefits of ephemeral storage (i.e. lower multi-tenancy ratio, faster IO throughput) with those of external storage (i.e. host failure tolerance). Multi-tenancy performance degradation is also not a significant factor because OrionVM uses a distributed, non-centralized storage architecture. This approach scales well horizontally because adding a new host increases both instance and storage capacity. Use of 40 Gb/s Infiniband also provides very high instance to storage throughput. Our own benchmarking shows very good IO performance with OrionVM. Complete results for these benchmarks are available on our website. A summary is provided below comparing OrionVM to both external and ephemeral instances with EC2, GoGrid, Joyent, Rackspace and SoftLayer. In these results, OrionVM performed very well as did EC2's cluster compute instance using ephemeral or EBS raid 0 volumes. GoGrid also performed well running on their new Westmere hardware and ephemeral storage. Details on the IO metric are available here. We are including these benchmark results to demonstrate that external storage can perform as well or better than ephemeral storage.

Legend

LabelStorage TypeDescription
ec2-us-east.cc1.4xlarge-raid0-localEphemeralEC2 cluster instance cc1.4xlarge, Raid 0, 2 ephemeral volumes
ec2-us-east.cc1.4xlarge-raid0x4-ebsExternalEC2 cluster instance cc1.4xlarge, Raid 0, 4 EBS volumes
ec2-us-east.cc1.4xlarge-localEphemeralEC2 cluster instance cc1.4xlarge, single ephemeral volume
gg-16gb-us-eastEphemeral16GB GoGrid instance
or-16gbExternal16GB OrionVM instance
jy-16gb-linuxEphemeral16GB Joyent Linux Virtual Machine
ec2-us-east.cc1.4xlargeExternalEC2 cluster instance cc1.4xlarge, single EBS volume
ec2-us-east.m2.4xlarge-raid0x4-ebsExternalEC2 high memory instance m2.4xlarge, Raid 0, 4 EBS volumes
rs-16gbEphemeral16GB Rackspace Cloud instance
ec2-us-east.m2.4xlargeExternalEC2 high memory instance m2.4xlarge, single EBS volume
sl-16gb-wdcExternal16GB SoftLayer CloudLayer instance

Summary

Last week's EBS outage has shed some light on what we consider to be one of the biggest cruces of the cloud, the problem of external storage. However, we see this event more in terms of the glass half full. First, we believe that AWS will thoroughly dissect this outage and use it to improve the fault tolerance and reliability of EBS in the future. Next, cloud users affected by this outage will re-evaluate their own cloud architecture and adopt a more failure tolerant approach. Finally, we hope that AWS and other vendors like OrionVM will continue to innovate in the external storage space.