CloudHarmony | Blog: 2011

Thursday, November 17, 2011

Is Joyent Really 14X Faster than EC2 and Azure the "Fastest Cloud"? Questions to Ask About Benchmark Studies

Many are skeptical of claims that involve benchmarks. Over the years benchmarks have been manipulated and misrepresented. Benchmarks aren't inherently bad or created in bad faith. To the contrary, when understood and applied correctly, benchmarks can often provide useful insight for performance analysis and capacity planning. The problem with benchmarks is they are often misunderstood or misrepresented, frequently resulting in bold assertions and questionable claims. Oftentimes there are also extraneous factors involved such as agenda-driven marketing organizations. In fact, the term "benchmarketing" was coined to describe questionable marketing-driven, benchmark-based claims. This post will discuss a few questions one might consider when reading benchmark-based claims. We'll then apply these questions to 2 recent cloud related, benchmark-based studies.

Questions to consider

The following are 7 questions one might ask when considering benchmark-based claims. Answering these questions will help to provide a clearer understanding on the validity and applicability of the claims.

What is the claim? Typically the bold-face, attention grabbing headline like Service Y is 10X faster than Service Z
What is the claimed measurement? Usually implied by the headline. For example the claim Service Y is 10X faster than Service Z implies a measurement of system performance
What is the actual measurement? To answer this question, look at the methodology and benchmark(s) used. This may require some digging, but can usually be found somewhere in the article body. Once found, do some research to determine what was actually measured. For example, if Geekbench was used, you would discover the actual measurement is processor and memory performance, but not disk or network IO
Is it an apples-to-apples comparison? The validity of a benchmark-based claim ultimately depends on the fairness of the testing methodology. Claims involving comparisons should compare similar things. For example, Ford could compare a Mustang Shelby GT500 (top speed 190 MPH) to a Chevy Aveo (top speed 100 MPH) and claim their cars are nearly twice as fast, but the Aveo is not a comparable vehicle and therefore the claim would be invalid. A more fair, apples-to-apples comparison would be a Mustang GT500 and a Chevy Camaro ZL1 (top speed 186).
Is the playing field level? Another important question to ask is whether or not there are any extraneous factors that provided an unfair advantage to one test subject over another. For example, using the top speed analogy, Ford could compare a Mustang with 92 octane fuel and a downhill course to a Camaro with 85 octane fuel and an uphill course. Because there are extraneous factors (fuel and angle of the course) which provided an unfair advantage to the Mustang, the claim would be invalid. To be fair, the top speeds of both vehicles should be measured on the same course, with the same fuel, fuel quantity, driver and weather conditions.
Was the data reported accurately? Benchmarking often results in large datasets. Summarizing the data concisely and accurately can be challenging. Things to watch out for include lack of good statistical analysis (i.e. reporting average only), math errors, and sloppy calculations. For example, if large, highly variable data is collected, it is generally a best practice to report the median value in place of mean (average) to mitigate the effects of outliers. Standard deviation is also a useful metric to include to identify data consistency.
Does it matter to you? The final question to ask is, assuming the results are valid, does it actually mean anything to you? For example, purchasing a vehicle based on a top speed comparison is not advisable if fuel economy is what really matters to you.

Case Study #1: Joyent Cloud versus AWS EC2

In this case study, Joyent sponsored a third party benchmarking study to compare Joyent Cloud to AWS EC2. The study utilized our own (CloudHarmony) benchmarking methodology to compare 3 categories of performance: CPU, Disk IO and Memory. The end results of the study are published on the Joyent website available here. In the table below, we'll apply the questions listed above to this study. Answers will be color coded green where the study provided a positive response to the question, and red where the results are misleading or misrepresented.
Questions & Answers

What is the claim?

Joyent Cloud is 3x - 14x Faster than AWS EC2

The claims are broken down by measurement type (CPU, Disk IO, Memory), and OS type (SmartOS/Open Solaris, Linux). The resulting large, colorful icons on the Joyent website claim that Joyent Cloud is faster than EC2 by a margin of 3x - 14x

What is the claimed measurement?

CPU, Disk IO, Memory Performance

Our benchmarking methodology was used to measure these different categories of performance. This methodology consists of running multiple benchmarks per category and creating a composite measurement based on a summary of the results for all benchmarks in each category. The methodology is described in more detail on our blog here (CPU), here (Disk IO) and here (Memory).

What is the actual measurement?

CPU, Disk IO, Memory Performance

Is it an apples-to-apples comparison?

Dissimilar instance types were compared

In the Linux comparison, Joyent claims 5x faster CPU, 3x faster Disk IO, and 4x faster memory. Based on the report details, it appears those ratios originate from comparing a 1GB Joyent VM to an EC2 m1.small. This selection provided the largest performance differential and hence the biggest claim. While price-wise, these instance types are similar (disregarding m1.small spot and reserve pricing where it is 1/2 the cost), that is where the similarities stop. At the time of this report, m1.small was the slowest EC2 instance with a single core and older CPU, while Joyent's 1GB instance type has 2 burstable cores and a newer CPU. The m1.small is not intended for compute intensive tasks. For that type of workload EC2 offers other options with newer CPUs and more cores. To provide an apples-to-apples comparison on performance, the claim should be based on 2 instance types that are intended for such a purpose (e.g. an EC2 m2 or cc1).

Is the playing field level?

Operating system and storage type were different

The study compares Joyent Cloud VMs running SmartOS or Ubuntu 10.04 to AWS EC2 VMs running CentOS 5.4. Joyent's SmartOS is based on Open Solaris and highly optimized for the Joyent environment. Ubuntu 10.04 uses Linux Kernel 2.6.32 (release date: Dec 2009) which is over 3 years newer than the 2.6.18 kernel (release date: Sep 2006) in CentOS 5.4. Newer and more optimized operating systems will almost always perform better for similar tasks on identical hardware. This provided an advantage to the Joyent VMs from the offset.

Additionally, the tests compared EC2 instances running on networked storage (EBS) to Joyent instances running on local storage, which also provided an advantage to the Joyent VMs for the disk IO benchmarks.

Was the data reported accurately?

Mistakes were made in calculations

This study was based on a cloud performance comparison methodology we (CloudHarmony) developed for a series of blog posts in 2010. For CPU performance, we developed an algorithm that combined the results of 19 different CPU benchmarks to provide a single performance metric that attempts to approximate the AW ECU (Elastic Compute Unit). To do so, we utilized EC2 instances and their associated ECU value as a baseline. We called this metric CCU and the algorithm for producing it was described in this blog post. Part of the algorithm involved calculating CCU when performance exceeded the largest baseline EC2 instance type, the 26 ECU m2.4xlarge. In our algorithm we used the performance differential ratio between an m1.small (1 ECU) and m2.4xlarge (26 ECUs). The third party, however, used the ratio between an m2.2xlarge (13 ECUs) and m2.4xlarge (26 ECUs). Because m2s run on the same hardware type, the performance difference between an m2.2xlarge and an m2.4xlarge is not very great, but the difference in ECUs is very high. The end results was their calculations producing a very high CCU value for the Joyent instances (in the range of 58-67 CCUs). Had the correct algorithm been used, the reported CCUs would have been much lower.

Does it matter to you?

Probably not

There isn't much value or validity to the data provided in these reports. The bold headlines which state Joyent Cloud is 3X - 14X faster than EC2 are based on very shaky grounds. In fact, with Joyent's approval, we recently ran our benchmarks in their environment resulting in the following CPU, disk IO and memory performance metrics: CloudHarmony Generated Joyent/EC2 Performance Comparison

CPU Performance: AWS EC2 vs Joyent
View Full Report
Provider	Instance Type	Memory	Cost	CCU
EC2	cc1.4xlarge	23 GB	$1.30/hr	33.5
Joyent	XXXL 48GB (8 CPU)	48 GB	$1.68/hr	28.44
EC2	m2.4xlarge	68.4 GB	$2.00/hr	26
EC2	m2.2xlarge	34.2 GB	$1.00/hr	13
Joyent	XL 16GB (3 CPU)	16 GB	$0.64/hr	10.94
Joyent	XXL 32GB (4 CPU)	32 GB	$1.12/hr	6.82
EC2	m2.xlarge	17.1 GB	$0.50/hr	6.5
Joyent	Large 8GB (2 CPU)	8 GB	$0.36/hr	6.19
Joyent	Medium 4GB (1 CPU)	4 GB	$0.24/hr	5.53
Joyent	Medium 2GB (1 CPU)	2 GB	$0.17/hr	5.45
Joyent	Small 1GB (1 CPU)	1 GB	$0.085/hr	4.66
EC2	m1.large	7.5 GB	$0.34/hr	4
EC2	m1.small	1.7 GB	$0.085/hr	1

Disk IO Performance: AWS EC2 vs Joyent
View Full Report - Note: the EC2 instances labeled EBS utilized a single networked storage volume - better performance may be possible using local storage or multiple EBS volumes. All Joyent instances utilized local storage (networked storage is not available).
Provider	Instance Type	Memory	Cost	IOP
EC2	cc1.4xlarge (local storage - raid 0)	23 GB	$1.30/hr	212.06
EC2	cc1.4xlarge (local storage)	23 GB	$1.30/hr	194.29
Joyent	XXXL 48GB (8 CPU)	48 GB	$1.68/hr	187.38
Joyent	XL 16GB (3 CPU)	16 GB	$0.64/hr	144.71
Joyent	XXL 32GB (4 CPU)	32 GB	$1.12/hr	142.19
Joyent	Large 8GB (2 CPU)	8 GB	$0.36/hr	130.84
Joyent	Medium 4GB (1 CPU)	4 GB	$0.24/hr	110.78
Joyent	Medium 2GB (1 CPU)	2 GB	$0.17/hr	109.2
EC2	m2.2xlarge (EBS)	34.2 GB	$1.00/hr	87.58
EC2	m2.xlarge (EBS)	17.1 GB	$0.50/hr	83.62
EC2	m2.4xlarge (EBS)	68.4 GB	$2.00/hr	82.79
EC2	m1.large (EBS)	7.5 GB	$0.34/hr	56.82
Joyent	Small 1GB (1 CPU)	1 GB	$0.085/hr	56.08
EC2	m1.small (EBS)	1.7 GB	$0.085/hr	27.08

Memory Performance: AWS EC2 vs Joyent
View Full Report
Provider	Instance Type	Memory	Cost	CCU
EC2	cc1.4xlarge	23 GB	$1.30/hr	137.2
EC2	m2.2xlarge	34.2 GB	$1.00/hr	109.41
EC2	m2.4xlarge	68.4 GB	$2.00/hr	109.14
EC2	m2.xlarge	17.1 GB	$0.50/hr	103.35
Joyent	XL 16GB (3 CPU)	16 GB	$0.64/hr	100.87
Joyent	XXXL 48GB (8 CPU)	48 GB	$1.68/hr	92.5
Joyent	XXL 32GB (4 CPU)	32 GB	$1.12/hr	90.79
Joyent	Large 8GB (2 CPU)	8 GB	$0.36/hr	90.37
Joyent	Medium 2GB (1 CPU)	2 GB	$0.17/hr	84.2
Joyent	Small 1GB (1 CPU)	1 GB	$0.085/hr	78.51
Joyent	Medium 4GB (1 CPU)	4 GB	$0.24/hr	76.04
EC2	m1.large	7.5 GB	$0.34/hr	61.8
EC2	m1.small	1.7 GB	$0.085/hr	22.24

Case Study #2: Microsoft Azure Named Fastest Cloud Service

In October 2011, Compuware published a blog post related to cloud performance. This post was picked up by various media outlets resulting in the following headlines:

Azure Tops Cloud Provider Performance Index (ReadWrite Cloud)
Windows Azure beats Amazon EC2, Google App Engine in cloud speed test (ars technica)
Microsoft Azure Named Fastest Cloud Service (Information Week)
Windows Azure cloud fastest according to tests (Ewandoo)

Here's how the test worked in a nutshell:

Two sample e-commerce web pages were created. The first with items description and 40 thumbnails (product list page), and the second with a single 1.75 MB image (product details page)
These pages were made accessible using a Java application server (Tomcat 6) running in each cloud environment. The exception to this is Microsoft Azure and Google AppEngine (platform-as-a-service/PaaS environments) which required the pages to be bundled and deployed using their specific technology stack
30 monitoring servers/nodes were instructed to request these 2 pages in succession every 15 minutes and record the amount of time it took to render both in their entirety (including the embedded images)
The 30 monitoring nodes are located in data centers in North America (19), Europe (5), Asia (3), Australia (1) and South America (2) - they are part of the Gomez Performance Network (GPN) monitoring service
After 1 year an average response time was calculated for each service (response times above 10 seconds were discarded)

Now lets dig a little deeper...
Questions & Answers

What is the claim?	Microsoft Azure is the "fastest cloud"

What is the claimed measurement?	Overall performance (it's fastest)

What is the actual measurement?	Network Latency & Throughput
Rendering 2 html pages and some images is not CPU intensive and as such is not a measure of system performance. The main bottleneck is network latency and throughput, particularly to distant monitoring nodes (e.g. Australia to US)
Is it an apples-to-apples comparison?	Types of services tested are different (IaaS vs PaaS) and the instance types are dissimilar
Microsoft Azure and Google AppEngine are platform-as-a-service (PaaS) environments, very different from infrastructure-as-a-service (IaaS) environments like EC2 and GoGrid. With PaaS, users must package and deploy applications using custom tools and more limited capabilities. Applications are deployed to large clustered, multi-tenant environments. Because of the greater structure and more limited capabilities of PaaS, providers are able to better optimize and scale those applications, often resulting in better performance and availability when compared to a single server IaaS deployment. Not much information is disclosed regarding the sizes of instances used for the IaaS services. With some IaaS providers, network performance can vary depending on instance size. For example, with Rackspace Cloud, a 256MB cloud server is capped with a 10 Mbps uplink. With EC2, bandwidth is shared across all instances deployed to a physical host. Smaller instance sizes generally have less, and more variable bandwidth. This test was conducted using the nearly smallest EC2 instance size, an m1.small.
Is the playing field level?	Services may have unfair advantage due network proximity and uplink performance
Because network latency is the main bottleneck for this test, and only a handful of monitoring nodes were used, the results are highly dependent on network proximity and latency between the services tested and the monitoring nodes. For example, the Chicago monitoring node might be sitting in the same building as the Azure US Central servers giving Azure and unfair advantage in the test. Additionally, the IaaS services where uplinks are capped on smaller instance types would be at a disadvantage to uncapped PaaS and IaaS environments.
Was the data reported accurately?	Simple average was reported - no median, standard deviation or regional breakouts were provided
The CloudSleuth post provided a single metric only… the average response time for each service across all monitoring nodes. A better way to report this data would involve breaking the data down by region. For example, average response time for eastern US monitoring nodes. Reporting median, standard deviation and 90th percentile statistical calculations would also be very helpful in evaluating the data.
Does it matter to you?	Probably not
Unless your users are sitting in the same 30 data centers as the GPN monitoring nodes, this study probably means very little. It does not represent a real world scenario where static content like images would be deployed to a distributed content delivery network like CloudFront or Edgecast. It attempts to compare two different types of cloud services, PaaS and IaaS. It may use IaaS instance types like the EC2 m1.small that represent the worst case performance scenario. The 30 node test population is also very small and not indicative of a real end user population (end users don't sit in data centers). Finally, reporting only a single average value ignores most statistical best practices.

Monday, October 17, 2011

Encoding Performance: Comparing Zencoder, Encoding.com, Sorenson & Panda

A few months ago we were approached by Zencoder to conduct a sponsored performance comparison of 4 encoding services including their own. The purpose was to validate their claims of faster encoding times using an independent, credible external source. This was a new frontier for us. Our primary focus has been performance analysis of infrastructure as a service (IaaS). However, we are curious about all things related to cloud and benchmarking and we felt this could be useful data to make available publicly, so we accepted.

Testing Methodology

This is a description of the methodology we used for conducting this performance analysis.

Source Media

Following discussions with Zencoder, we opted to test encoding performance using 4 distinct media types. We were tasked with finding samples for each media type, they were not provided by Zencoder. All source media was stored in AWS S3 using the US East region (the same AWS region each of the 4 encoding services are hosted from). The 4 media types we used for testing are:

HD Video: We chose an HD 1080P trailer for the movie Avatar. This file was 223.1 MB in size and 3 mins, 30 secs in duration.
SD Video: We chose a 480P video episode from a cartoon series. The file was 519.2 MB in size and about 23 mins in duration.
Mobile Video: We created a 568x320 video using an iPhone (source here). The file was 2.9 MB in size, 30 secs in duration.
MP3 Audio: We used an MP3 file we found on the web about Yoga (source here). The file was 42.2 MB in size, 58 mins 41 secs in duration.

Encode Settings

We used the same encode options across all of the services tested. The following is a summary of the encode options used for each corresponding media type:

Media Type	Video Codec	Video Bitrate	Audio Codec	Audio Bitrate	Encode Passes
HD Video	H.264	3000 Kb/s	AAC	96 Kb/s	2
SD Video	H.264	1000 Kb/s	AAC	96 Kb/s	2
Mobile Video	H.264	500 Kb/s	AAC	96 Kb/s	2
MP3 Audio	NA	NA	AAC	96 Kb/s	2

Test Scheduling

Testing was conducted during a span of 1 week. We built a test harness that integrated with the APIs of each of the 4 encoding services. The test harness invoked 2 test iterations daily with each service. Testing included both single request and 4 parallel requests. Each test iteration consisted of the following 8 test scenarios:

Single HD Video Request
Single SD Video Request
Single Mobile Video Request
Single MP3 Audio Request
4 Parallel HD Video Requests
4 Parallel SD Video Requests
4 Parallel Mobile Video Requests
4 Parallel MP3 Audio Requests

The order of the test scenarios was randomized, but the same tests were always requested at the same time for each service. Each test was run to completion on all services before the next test was invoked. The start times for 2 daily test iterations were separated by about 12 hours and incremented by 100 minutes each day. The end result was a distribution of test scenarios during many different times of the day. A total of 112 tests were performed during the 1 week test span, followed by an additional 24 test scenarios on encoding.com to test different combinations of service specific encoding options (described below).

Performance Metrics

During testing, we captured the following metrics:

Encode Time: The amount of time in seconds required to encode the media (excludes transfer time)
Transfer Time: The amount of time in seconds required to transfer the source media from AWS S3 into the service processing queue
Failures: When a service failed to complete a job

Test Results

The following is a summary of the results for each of the 8 test scenarios. Result tables have the following columns:

Avg Encode Time: The mean (average) encoding time in seconds for all jobs in this test scenario
Standard Deviation %: Standard deviation as percentage of the mean. A lower values indicates more consistency in performance
Median Encode Time: The median encoding time in seconds for all jobs in this test scenario
Avg Total Time: The mean (average) total job time in seconds. The total time is the sum of transfer (source files were hosted in AWS S3 US East), queue and encoding times, essentially the total turnaround time from the moment the job was submitted. Sorenson and Panda may cache source media files, thus reducing transfer time for future requests

A graph is displayed below each results table depicting sorted average encode and total times using a dual series horizontal bar chart. Also depicted on the graph is a line indicating the actual duration of the source media. Encode time bars that terminate to the left of this line signify faster than realtime encoding performance.

encoding.com offers a few different encoding job options that can affect encode performance. These options include the following:

instant: using the instant option, encoding.com will begin encoding before the source media has been fully downloaded. For larger source files, this can decrease encode times
twin turbo: this setting causes jobs to be delegated to faster servers in exchange for paying a $2/GB premium for encoding. encoding.com states that this option will deliver 6-8X faster encoding time over standard servers

Test Results: Single HD Video Request

The following are the results from a total of 14 HD video encode jobs submitted at various times of the day over a period of 1 week:

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	141.08	6	137	155.08
encoding.com 2 (max+TT)	213.38	16	193	244.53
encoding.com 1 (max+TT+instant)	213.5	11	213.5	238
encoding.com 4 (plus+TT+instant)	226.75	11	235.5	258.75
encoding.com 3 (max)	618.75	12	643.5	653.25
sorenson	974.31	2	975	982.46
panda	1246.31	4	1247	1255.93

Failures - None

Test Results: Single SD Video Request

The following are the results from a total of 14 SD video encode jobs submitted at various times of the day over a period of 1 week:

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	229.14	9	225.5	281.64
encoding.com 4 (plus+TT+instant)	453	1	453.5	718.75
encoding.com 2 (max+TT)	454.07	11	458.5	689.21
encoding.com 1 (max+TT+instant)	474	7	474	732.5
encoding.com 3 (max)	1137.75	12	1162	1429
panda	1649.36	7	1640.5	1673.93
sorenson	2087.23	8	2052	2101

Failures

Sorenson: 1

Test Results: Single Mobile Video Request

The following are the results from a total of 14 mobile video encode jobs submitted at various times of the day over a period of 1 week. NOTE: During our testing, encoding.com jobs would occasionally experience excessive queue times with the status "Waiting for encoder". This is the reason for the long green section representing the transfer/queue time delta on the graph below.

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	19.38	12	20	30.23
panda	25.69	4	25	26.31
encoding.com 4 (plus+TT+instant)	35	7	34	122.75
encoding.com 3 (max)	40.25	9	40.5	51.25
sorenson	47	6	46	68.31
encoding.com 2 (max+TT)	76.92	139	31	88.23
encoding.com 1 (max+TT+instant)	97	67	97	112

Failures - None

Test Results: Single MP3 Audio Request

The following are the results from a total of 14 audio encode jobs submitted at various times of the day over a period of 1 week. In this test scenario, we again experience some long queue times during one of the encoding.com test phases.

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	120.31	11	115	131
sorenson	189.38	11	182	196.23
panda	218.58	3	217	222.16
encoding.com 1 (max+TT+instant)	221	0	221	255
encoding.com 4 (plus+TT+instant)	224.75	9	225	248.75
encoding.com 3 (max)	240.5	7	238	355.75
encoding.com 2 (max+TT)	328.36	52	254	360.45

Failures

encoding.com max+TT+instant: 1

encoding.com max+TT: 2

Test Results: 4 Parallel HD Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) HD video encode jobs submitted at various times of the day over a period of 1 week:

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	141.23	4	140	153.71
encoding.com 1 (max+TT+instant)	236.88	19	250	280.13
encoding.com 2 (max+TT)	292.21	46	259.5	321.96
encoding.com 4 (plus+TT+instant)	298.69	37	256.5	354.44
encoding.com 3 (max)	723.44	25	751.5	796.38
panda	1322.19	17	1249	1336.9
sorenson	1751.44	3	1745	1769.71

Failures - None

Test Results: 4 Parallel SD Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) SD video encode jobs submitted at various times of the day over a period of 1 week:

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	240.48	17	224.5	327.98
encoding.com 2 (max+TT)	439.5	21	445	700.15
encoding.com 4 (plus+TT+instant)	461.56	14	459.5	757.87
encoding.com 1 (max+TT+instant)	484.88	10	459	770.01
encoding.com 3 (max)	1274.63	11	1274	1552.69
panda	1680.96	10	1648.5	1710.9
sorenson	3777.54	10	3835	3808.94

Failures - None

Test Results: 4 Parallel Mobile Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) mobile video encode jobs submitted at various times of the day over a period of 1 week. In this test scenario, we again experience some long queue times during one of the encoding.com test phases.

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	20.4	15	20	27.27
panda	26.65	7	26	27.21
encoding.com 1 (max+TT+instant)	34	8	34	49.75
sorenson	55.2	15	53	77.54
encoding.com 4 (plus+TT+instant)	68	60	44.5	106.63
encoding.com 2 (max+TT)	87.46	100	37	104.92
encoding.com 3 (max)	115.88	197	46	149.32

Failures

sorenson: 8

Test Results: 4 Parallel MP3 Audio Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) audio encode jobs submitted at various times of the day over a period of 1 week:

Service	Avg Encode Time	Standard Deviation %	Median Encode Time	Avg Total Time
zencoder	115.62	6	114	126.16
sorenson	187.24	5	184	204.57
panda	225.15	17	214	227.92
encoding.com 4 (plus+TT+instant)	234.38	17	228.5	266.63
encoding.com 2 (max+TT)	265.43	19	261	313.56
encoding.com 1 (max+TT+instant)	273.67	17	274	344
encoding.com 3 (max)	285.69	31	259	317.15

Failures

sorenson: 3

encoding.com max+TT+instant: 5

encoding.com max+TT: 5

encoding.com max: 3

Encoding Service Accounts

The following encoding services are included in this analysis: Zencoder, encoding.com, Sorenson Media, and Panda. Each service offers different account and pricing options. We setup an account with each service (including Zencoder) using their standard signup process. Because 4 parallel job requests were part of the testing, we opted for service plans that would limit the effects of queue time under such conditions. The pricing data below is for informational purposes only. The purpose of this post is not to compare service pricing, as there are simply too many variations between services to be able to do so. The following is a summary of the account options we selected and pricing with each service:

Service	Plan Used	Plan Cost	Plan Usage Included	Encoding Cost	Total Test Costs
Zencoder	Launch	$40/mo	1000 Encoding Mins	$0.08/min HD; $0.04/min SD; $0.01/min audio	$91
encoding.com	Max	$299/mo	75GB Encoded Media	NA	$299
Sorenson Media	Squeeze Managed Server	$199/mo	1200 Encoding Mins	$8 per 60 Mins extra	$375 (incl addl encoding mins)
Panda	4 dedicated encoders	$396	Unlimited encoding - $0.15/GB to upload encoded media	NA	$436 (incl bandwidth)

Disclaimer

This comparison was sponsored by Zencoder. In order to sustain our ability to provide useful, free & publicly accessible analysis, we frequently take on paid engagements. However, in doing so, we try to maintain our credibility and objectivity by using reliable tests, being transparent about our test methods, and attempting to represent the data in a fair way.

Summary

In order to maintain our objectivity and independence, we generally do not recommend one service over another. We prefer to simply present the data as it stands, and let readers draw their own conclusions.

Monday, April 25, 2011

An unofficial EC2 outage postmortem - the sky is not falling

Last week Amazon Web Services (AWS) experienced a high profile outage affecting Elastic Cloud Compute (EC2) and Elastic Block Storage (EBS) in 1 of 4 data centers in the US East region. This outage caused some high profile website outages including Reddit, Quora and FourSquare and scores of negative PR. In the proceeding days media outlets and bloggers have written literally hundreds of articles such as Amazon's Trouble Raises Cloud Computing Doubts (New York Times), The Day The Cloud Died (Forbes), Amazon outage sparks frustration, doubts about cloud (Computerworld), and many others.

EC2 and EBS in a nutshell

In case you are not familiar with the technical jargon and acronyms, EBS is one of two methods provided by AWS for setting up an EC2 instance (an EC2 instance is essentially a server) storage volumes (basically a cloud hard drive). Unlike a traditional hard drive that is located physically inside of a computer, EBS is stored externally on dedicated storage boxes and connected to EC2 instances over a network. The second storage option provided by EC2 is called ephemeral, which uses this more traditional method of hard drives located physically inside the same hardware that an EC2 instance runs on. Using EBS is encouraged by AWS and provides some unique benefits not available with ephemeral storage. One such benefit is the ability to recover quickly from a host failure (a host is the hardware that an EC2 instance runs on). If the host fails for an EBS EC2 instance, it can quickly be restarted on another host because its storage does not reside on the failed host. On the contrary, if the host fails for an ephemeral EC2 instance, that instance and all of the data stored on it will be permanently lost. EBS instances can also be shutdown temporarily and restarted later, whereas ephemeral instances are deleted if shut down. EBS also theoretically provides better performance and reliability when compared to ephemeral storage.

Other technical terms you may hear and should understand regarding EC2 are virtualization and multi-tenancy. Virtualization allows AWS to run multiple EC2 instances on a single physical host by creating simulated "virtual" hardware environments for each instance. Without virtualization, AWS would have to maintain a 1-to-1 ratio between EC2 instance and physical hardware, and the economics just wouldn't work. Multi-tenancy is a consequence of virtualization in that multiple EC2 instances share access to physical hardware. Multi-tenancy often causes performance degradation in virtualized environments because instances may need to wait briefly to obtain access to physical resources like CPU, hard disk or network. The term noisy neighbor is often used to describe this scenario in very busy environments where virtual instances are waiting frequently for physical resources causing noticeable declines in performance.

EC2 is generally a very reliable service. Without a strong track record high profile websites like Netflix would not use it. We conduct ongoing independent outage monitoring of over 100 cloud services which shows 3 of the 5 AWS EC2 regions having 100% availability the past year. In fact, our own EBS backed EC2 instance in the affected US East region remained online throughout last week's outage.

AWS endorses a different type of architectural philosophy called designing for failure. In this context, instead of deploying highly redundant and fault tolerant (and very expensive) "enterprise" hardware, AWS uses low cost commodity hardware and designs their infrastructure to expect and deal gracefully with failure. AWS deals with failure using replication. For example, each EBS volume is stored on 2 separate storage arrays. In theory, if one storage array fails, its' volumes are quickly replaced with the backup copies. This approach provides many of the benefits of enterprise hardware, such as fault tolerance and resiliency, while at the same time providing substantially lower hardware costs enabling AWS to price their services competitively.

The outage - what went wrong?

Disclaimer: This is our own opinion of what occurred during last week's EC2 outage based on our interpretation of the comments provided on the AWS Service Health Dashboard and basic knowledge of the EC2/EBS architecture.

At about 1AM PST on Thursday April 21st, one of the four availability zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS. This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as "re-mirroring"). While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own. All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.

The sky is not falling

Despite what some media outlets, bloggers and AWS competitors are claiming, we do not believe this event is reason to question the viability AWS, external instance storage, or the cloud in general. AWS has stated they will evaluate closely the events that triggered this outage, and apply appropriate remedies. The end result will be a more robust and battle hardened EBS architecture. For users of AWS affected by this outage, this should be cause to re-evaluate their cloud architecture. There are many techniques suggested by AWS and prominent AWS users that will help to deal with these types of outages in the future without incurring significant downtime. These include deploying load balanced servers across multiple availability zones and using more than one AWS region.

Netflix is a large and very visible client of AWS that was not affected by this outage. The reason for this is that they have learned to design for failure. In a recent blog post, Adrian Cockroft (Netflix's Cloud Architect), wrote about some of the technical details and shortcomings of EBS. At a high level, the take away points from his post are:

EC2, EBS and the network that attach them are all shared resources. As such, performance will vary significantly depending on multi-tenancy and shared load. Performance variance will be greater on smaller EC2 instances and EBS volumes where multi-tenancy is a greater factor
Users can reduce the potential affects of multi-tenancy by using larger EC2 instances and EBS volumes. To reduce EBS mulit-tenancy, Netflix uses the largest possible volume size, 1TB. Because each EBS storage array has a limited amount of storage capacity, using larger sized volumes reduces the number of other users that may share that hardware. The same is true of larger EC2 instances. In fact, the largest EC2 instances (any of the 4xlarges) run on dedicated hardware. Because each physical EC2 host has one shared network interface, use of larger EBS volumes and EC2 instances also has the added benefit of increased network throughput
Use ephemeral storage on EC2 instances where predictable and consistent performance is necessary. Netflix uses ephemeral storage for their Cassandra datastore and has found it to be more consistently reliable compared to EBS

Too early to throw in the towel

AWS is not alone in experiencing performance and reliability issues with external storage. Based on our independent monitoring Visi, GigeNet, Tata InstaCompute, Flexiscale, Ninefold and VPS.NET have all experienced similar outages. Our monitoring shows that external storage failures are a very significant cause of cloud outages. When external storage systems fail, vendors often have a very difficult time recovering quickly. Designing fault tolerant and performant external storage for the cloud is a very complex problem, so much so that many vendors including Rackspace Cloud and Joyent avoid it entirely. Joyent for example, recently documented their unsuccessful attempt to deploy external storage in their cloud service. However, despite the complexity of this problem, we believe it is far too early for cloud vendors and users to throw in the towel. There are significant advantages to external storage versus ephemeral including:

Host failure tolerance: If the power supply, motherboard, or any component of a host system fails, the instances running on it can be quickly migrated to another host
Shutdown capability: With most providers, external storage instances can be shutdown temporarily and then incur only storage fees
Greater flexibility: External storage offers features and flexibility generally unavailable with ephemeral storage. These may include the ability to backup volumes, create snapshots, clone, create custom OS templates, resize partitions and attach multiple storage volumes to a single instance

Innovation in external storage

Besides AWS, there are other providers innovating in the external storage space. OrionVM, a cloud startup in Australia, has developed their own distributed, horizontally scalable, external storage architecture based on a high performance communication link called Infiniband. Instead of using dedicated storage hardware, OrionVM uses the same hardware for both storage and server instances. The server instances use storage located on multiple external hosts connected to it via redundant 40 Gb/s InfiniBand links. If a physical host fails, the instances running on it can be restored on another host because their storage resides externally. OrionVM also replicates storage across multiple host systems allowing for fault tolerance should a storage host fail. This hybrid approach combines the benefits of ephemeral storage (i.e. lower multi-tenancy ratio, faster IO throughput) with those of external storage (i.e. host failure tolerance). Multi-tenancy performance degradation is also not a significant factor because OrionVM uses a distributed, non-centralized storage architecture. This approach scales well horizontally because adding a new host increases both instance and storage capacity. Use of 40 Gb/s Infiniband also provides very high instance to storage throughput. Our own benchmarking shows very good IO performance with OrionVM. Complete results for these benchmarks are available on our website. A summary is provided below comparing OrionVM to both external and ephemeral instances with EC2, GoGrid, Joyent, Rackspace and SoftLayer. In these results, OrionVM performed very well as did EC2's cluster compute instance using ephemeral or EBS raid 0 volumes. GoGrid also performed well running on their new Westmere hardware and ephemeral storage. Details on the IO metric are available here. We are including these benchmark results to demonstrate that external storage can perform as well or better than ephemeral storage.

Legend

Label	Storage Type	Description
ec2-us-east.cc1.4xlarge-raid0-local	Ephemeral	EC2 cluster instance cc1.4xlarge, Raid 0, 2 ephemeral volumes
ec2-us-east.cc1.4xlarge-raid0x4-ebs	External	EC2 cluster instance cc1.4xlarge, Raid 0, 4 EBS volumes
ec2-us-east.cc1.4xlarge-local	Ephemeral	EC2 cluster instance cc1.4xlarge, single ephemeral volume
gg-16gb-us-east	Ephemeral	16GB GoGrid instance
or-16gb	External	16GB OrionVM instance
jy-16gb-linux	Ephemeral	16GB Joyent Linux Virtual Machine
ec2-us-east.cc1.4xlarge	External	EC2 cluster instance cc1.4xlarge, single EBS volume
ec2-us-east.m2.4xlarge-raid0x4-ebs	External	EC2 high memory instance m2.4xlarge, Raid 0, 4 EBS volumes
rs-16gb	Ephemeral	16GB Rackspace Cloud instance
ec2-us-east.m2.4xlarge	External	EC2 high memory instance m2.4xlarge, single EBS volume
sl-16gb-wdc	External	16GB SoftLayer CloudLayer instance

Summary

Last week's EBS outage has shed some light on what we consider to be one of the biggest cruces of the cloud, the problem of external storage. However, we see this event more in terms of the glass half full. First, we believe that AWS will thoroughly dissect this outage and use it to improve the fault tolerance and reliability of EBS in the future. Next, cloud users affected by this outage will re-evaluate their own cloud architecture and adopt a more failure tolerant approach. Finally, we hope that AWS and other vendors like OrionVM will continue to innovate in the external storage space.

Saturday, January 15, 2011

Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

SLAs versus Measured Availability

The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.

Provider	Data Center	Total # Outages / Mins Down	SLA Credit Policy	SLA	Actual
AWS EC2	US East	0/0	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	100%
AWS EC2	US West	0/0	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	100%
GoGrid	US West	0/0	Pro-rated 100x credit for any downtime	100%	100%
Linode VPS	London	0/0	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	100%
OpSource Cloud	VA, US	0/0	Percentage 5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on	100%	100%
Storm on Demand	MI, US	0/0	Pro-rated 10x credit for any downtime	100%	100%
VoxCLOUD	EU	0/0	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	100%
GoGrid	US East	1/2.3	Pro-rated 100x credit for any downtime	100%	99.999%
Joyent Smart Machines	Andover, MA	1/3	Percentage 5% of the monthly fee for each 30 minutes of downtime	100%	99.999%
VoxCLOUD	Singapore	1/5.5	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	99.999%
Speedyrails VPS	Peer1 Quebec	1/2.2	Percentage 3% of monthly fees for every 0.1% of downtime	99.9%	99.999%
Rackspace Cloud	Dallas, TX	1/8.7	Threshold/ Percentage 5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification	100%¹	99.998%
SoftLayer CloudLayer	Dallas, TX	4/13.9	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.997%
Hosting.com	Colorado	1/1.4	Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair	100%¹	99.997%
AWS EC2	APAC	5/14.8	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	99.996%
Linode	Atlanta	10/26.9	Pro-rated Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.995%
Joyent Smart Machines	Emeryville, CA	4/15.2	Percentage 5% of the monthly fee for each 30 minutes of downtime	100%	99.994%
Terremark vCloud	FL, US	7/37.9	Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees	100%	99.993%
AWS EC2	EU West	3/36	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	99.993%
Speedyrails VPS	Canix Quebec	9/38.7	Percentage 3% of monthly fees for every 0.1% of downtime	99.9%	99.992%
Linode	Fremont, CA	13/71.9²	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.986%
Zerigo	CO, CA	9/66.8	Pro-rated 4x the total (starting from 100%, not 99.99%) non-compliant time	99.99%	99.985%
SoftLayer CloudLayer	DC, US	31/86.7	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.984%
SoftLayer CloudLayer	WA, US	13/106.8	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.980%
Linode	NJ, CA	14/145.7	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.972%
VoxCLOUD	NY, US	12/146.3³	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	99.972%
CloudSigma	Switzerland	22/59.9	Threshold/ Percentage 50x credit for any downtime (network or hardware) over 15 minutes	100%	99.972%
Hosting.com	KY, US	4/38.7⁴	Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair	100%¹	99.955%
ThePlanet Cloud Servers	TX, US	34/144.3	Threshold/ Percentage 5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage	100%	99.955%
Gandi VPS	France	4/147.7	Pro-rated 1 day credit for every outage over 7 minutes within a single day	99.95%	99.955%
Linode	Dallas	21/258.2	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.951%
NewServers	FL, US	39/288.7	Pro-rated 24x credit for every 1 hour of downtime exceeding 0.001%	99.999%	99.945%
VPS.NET	UK	8/250.3	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.921%
VPS.NET	US Central	12/342.9	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.892%
Flexiant	UK	83/820.3⁶	Percentage 5% monthly invoice credit for each 30 minutes of downtime	100%	99.844%
VPS.NET	US West	32/576.5	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.819%
ReliaCloud	MN, US	23/1941.5⁷	Pro-rated 30x hourly credit for each hour downtime	100%	99.626%
VPS.NET	US East	6/1224.1⁸	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.616%

1 Applies to network connectivity only, not hardware outages

2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode

3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010

4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved

5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.

6 Approximately 560 minutes of these outages occurred due to failure of their SAN

7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures

8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"

Is there a correlation between SLA and actual availability?

The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:

Total # of services analyzed:	38
Services that meet or exceeded SLA:	15/38 [39%]
Services that did not meet SLA:	23/38 [61%]
Vendors with 100% SLAs:	23/38 [61%]
Vendors with 100% SLAs achieving their SLA:	4/23 [17%]
Mean availability of vendors with 100% SLAs:	99.929% [6.22 hrs/yr]
Median availability of vendors with 100% SLAs:	99.982% [1.58 hrs/yr]

It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.

Vendors with generous SLA credit policies

In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:

GoGrid: provides a 100x credit policy combined with 100% SLA for any hardware and network outages and no minimum thresholds (e.g. 1 hour outage = 100 hour credit). This is by far the most generous of the 38 IaaS vendors we evaluated. GoGrid's service is also one of the most reliable IaaS services we currently monitor (100% US West and 99.999% US East)
Joyent: provides a 5% invoice SLA credit for each 30 minutes of monthly (non-continuous) downtime (equates to about 72x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
VoxCloud: provides a 5% invoice credit per 0.1% of monthly (non-continuous) downtime (about every 45 minutes - equates to about 48x pro-rated credit) combined with 100% SLA and no minimum outage thresholds

Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed

We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.

Cloud Outages RSS Feed

This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.

getAvailability Web Service

This post includes a small snapshot of the data we maintain on cloud availability. We have released a new web service that allows users to calculate availability and retrieve outage details (including supporting comments) for any of the 90 cloud services we currently monitor. Monitoring for many of these services began between October 2009 and January 2010, but we are also continually adding new services to the list. This web service allows users to calculate availability and retrieve outage information for any time frame, service type, vendor, etc. To get you started, we have provided a few example RESTful request URLs. These example requests return JSON formatted data. To request XML formatted data append &ws-format=xml to any of these URLs. Full API documentation for this web service is provided here. A SOAP WSDL is also provided here. You may invoke this web service for free up to 5 times daily. To purchase a web service token allowing additional web service invocations click here.

Retrieve availability for all IaaS vendors for the past year (first 10 of 46 results)

Retrieve availability for all IaaS vendors for the past year (results 11-20 of 46)

Retrieve availability for all CDNs for 2010 (first 10 of 13 results)

Retrieve availability for all CDNs for 2010 (results 11-13 of 13)

Retrieve availability for all AWS services (EC2, S3, CloudFront) for the past 6 months

Retrieve availability for GoGrid Cloud Servers for the past 2 weeks

Retrieve availability for VPS.net's US East data center since 1/1/2010 - include full outage documentation

Summary

Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.