Thursday, November 17, 2011

Is Joyent Really 14X Faster than EC2 and Azure the "Fastest Cloud"? Questions to Ask About Benchmark Studies

Many are skeptical of claims that involve benchmarks. Over the years benchmarks have been manipulated and misrepresented. Benchmarks aren't inherently bad or created in bad faith. To the contrary, when understood and applied correctly, benchmarks can often provide useful insight for performance analysis and capacity planning. The problem with benchmarks is they are often misunderstood or misrepresented, frequently resulting in bold assertions and questionable claims. Oftentimes there are also extraneous factors involved such as agenda-driven marketing organizations. In fact, the term "benchmarketing" was coined to describe questionable marketing-driven, benchmark-based claims. This post will discuss a few questions one might consider when reading benchmark-based claims. We'll then apply these questions to 2 recent cloud related, benchmark-based studies.

Questions to consider

The following are 7 questions one might ask when considering benchmark-based claims. Answering these questions will help to provide a clearer understanding on the validity and applicability of the claims.
  1. What is the claim? Typically the bold-face, attention grabbing headline like Service Y is 10X faster than Service Z
  2. What is the claimed measurement? Usually implied by the headline. For example the claim Service Y is 10X faster than Service Z implies a measurement of system performance
  3. What is the actual measurement? To answer this question, look at the methodology and benchmark(s) used. This may require some digging, but can usually be found somewhere in the article body. Once found, do some research to determine what was actually measured. For example, if Geekbench was used, you would discover the actual measurement is processor and memory performance, but not disk or network IO
  4. Is it an apples-to-apples comparison? The validity of a benchmark-based claim ultimately depends on the fairness of the testing methodology. Claims involving comparisons should compare similar things. For example, Ford could compare a Mustang Shelby GT500 (top speed 190 MPH) to a Chevy Aveo (top speed 100 MPH) and claim their cars are nearly twice as fast, but the Aveo is not a comparable vehicle and therefore the claim would be invalid. A more fair, apples-to-apples comparison would be a Mustang GT500 and a Chevy Camaro ZL1 (top speed 186).
  5. Is the playing field level? Another important question to ask is whether or not there are any extraneous factors that provided an unfair advantage to one test subject over another. For example, using the top speed analogy, Ford could compare a Mustang with 92 octane fuel and a downhill course to a Camaro with 85 octane fuel and an uphill course. Because there are extraneous factors (fuel and angle of the course) which provided an unfair advantage to the Mustang, the claim would be invalid. To be fair, the top speeds of both vehicles should be measured on the same course, with the same fuel, fuel quantity, driver and weather conditions.
  6. Was the data reported accurately? Benchmarking often results in large datasets. Summarizing the data concisely and accurately can be challenging. Things to watch out for include lack of good statistical analysis (i.e. reporting average only), math errors, and sloppy calculations. For example, if large, highly variable data is collected, it is generally a best practice to report the median value in place of mean (average) to mitigate the effects of outliers. Standard deviation is also a useful metric to include to identify data consistency.
  7. Does it matter to you? The final question to ask is, assuming the results are valid, does it actually mean anything to you? For example, purchasing a vehicle based on a top speed comparison is not advisable if fuel economy is what really matters to you.

Case Study #1: Joyent Cloud versus AWS EC2

In this case study, Joyent sponsored a third party benchmarking study to compare Joyent Cloud to AWS EC2. The study utilized our own (CloudHarmony) benchmarking methodology to compare 3 categories of performance: CPU, Disk IO and Memory. The end results of the study are published on the Joyent website available here. In the table below, we'll apply the questions listed above to this study. Answers will be color coded green where the study provided a positive response to the question, and red where the results are misleading or misrepresented.
Questions & Answers

What is the claim? Joyent Cloud is 3x - 14x Faster than AWS EC2
The claims are broken down by measurement type (CPU, Disk IO, Memory), and OS type (SmartOS/Open Solaris, Linux). The resulting large, colorful icons on the Joyent website claim that Joyent Cloud is faster than EC2 by a margin of 3x - 14x
What is the claimed measurement? CPU, Disk IO, Memory Performance
Our benchmarking methodology was used to measure these different categories of performance. This methodology consists of running multiple benchmarks per category and creating a composite measurement based on a summary of the results for all benchmarks in each category. The methodology is described in more detail on our blog here (CPU), here (Disk IO) and here (Memory).
What is the actual measurement? CPU, Disk IO, Memory Performance
Is it an apples-to-apples comparison? Dissimilar instance types were compared
In the Linux comparison, Joyent claims 5x faster CPU, 3x faster Disk IO, and 4x faster memory. Based on the report details, it appears those ratios originate from comparing a 1GB Joyent VM to an EC2 m1.small. This selection provided the largest performance differential and hence the biggest claim. While price-wise, these instance types are similar (disregarding m1.small spot and reserve pricing where it is 1/2 the cost), that is where the similarities stop. At the time of this report, m1.small was the slowest EC2 instance with a single core and older CPU, while Joyent's 1GB instance type has 2 burstable cores and a newer CPU. The m1.small is not intended for compute intensive tasks. For that type of workload EC2 offers other options with newer CPUs and more cores. To provide an apples-to-apples comparison on performance, the claim should be based on 2 instance types that are intended for such a purpose (e.g. an EC2 m2 or cc1).
Is the playing field level? Operating system and storage type were different
The study compares Joyent Cloud VMs running SmartOS or Ubuntu 10.04 to AWS EC2 VMs running CentOS 5.4. Joyent's SmartOS is based on Open Solaris and highly optimized for the Joyent environment. Ubuntu 10.04 uses Linux Kernel 2.6.32 (release date: Dec 2009) which is over 3 years newer than the 2.6.18 kernel (release date: Sep 2006) in CentOS 5.4. Newer and more optimized operating systems will almost always perform better for similar tasks on identical hardware. This provided an advantage to the Joyent VMs from the offset.

Additionally, the tests compared EC2 instances running on networked storage (EBS) to Joyent instances running on local storage, which also provided an advantage to the Joyent VMs for the disk IO benchmarks.
Was the data reported accurately? Mistakes were made in calculations
This study was based on a cloud performance comparison methodology we (CloudHarmony) developed for a series of blog posts in 2010. For CPU performance, we developed an algorithm that combined the results of 19 different CPU benchmarks to provide a single performance metric that attempts to approximate the AW ECU (Elastic Compute Unit). To do so, we utilized EC2 instances and their associated ECU value as a baseline. We called this metric CCU and the algorithm for producing it was described in this blog post. Part of the algorithm involved calculating CCU when performance exceeded the largest baseline EC2 instance type, the 26 ECU m2.4xlarge. In our algorithm we used the performance differential ratio between an m1.small (1 ECU) and m2.4xlarge (26 ECUs). The third party, however, used the ratio between an m2.2xlarge (13 ECUs) and m2.4xlarge (26 ECUs). Because m2s run on the same hardware type, the performance difference between an m2.2xlarge and an m2.4xlarge is not very great, but the difference in ECUs is very high. The end results was their calculations producing a very high CCU value for the Joyent instances (in the range of 58-67 CCUs). Had the correct algorithm been used, the reported CCUs would have been much lower.
Does it matter to you? Probably not
There isn't much value or validity to the data provided in these reports. The bold headlines which state Joyent Cloud is 3X - 14X faster than EC2 are based on very shaky grounds. In fact, with Joyent's approval, we recently ran our benchmarks in their environment resulting in the following CPU, disk IO and memory performance metrics: CloudHarmony Generated Joyent/EC2 Performance Comparison
CPU Performance: AWS EC2 vs Joyent
View Full Report
Provider Instance Type Memory Cost CCU
EC2 cc1.4xlarge 23 GB $1.30/hr 33.5
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 28.44
EC2 m2.4xlarge 68.4 GB $2.00/hr 26
EC2 m2.2xlarge 34.2 GB $1.00/hr 13
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 10.94
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 6.82
EC2 m2.xlarge 17.1 GB $0.50/hr 6.5
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 6.19
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 5.53
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 5.45
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 4.66
EC2 m1.large 7.5 GB $0.34/hr 4
EC2 m1.small 1.7 GB $0.085/hr 1
Disk IO Performance: AWS EC2 vs Joyent
View Full Report - Note: the EC2 instances labeled EBS utilized a single networked storage volume - better performance may be possible using local storage or multiple EBS volumes. All Joyent instances utilized local storage (networked storage is not available).
Provider Instance Type Memory Cost IOP
EC2 cc1.4xlarge (local storage - raid 0) 23 GB $1.30/hr 212.06
EC2 cc1.4xlarge (local storage) 23 GB $1.30/hr 194.29
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 187.38
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 144.71
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 142.19
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 130.84
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 110.78
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 109.2
EC2 m2.2xlarge (EBS) 34.2 GB $1.00/hr 87.58
EC2 m2.xlarge (EBS) 17.1 GB $0.50/hr 83.62
EC2 m2.4xlarge (EBS) 68.4 GB $2.00/hr 82.79
EC2 m1.large (EBS) 7.5 GB $0.34/hr 56.82
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 56.08
EC2 m1.small (EBS) 1.7 GB $0.085/hr 27.08
Memory Performance: AWS EC2 vs Joyent
View Full Report
Provider Instance Type Memory Cost CCU
EC2 cc1.4xlarge 23 GB $1.30/hr 137.2
EC2 m2.2xlarge 34.2 GB $1.00/hr 109.41
EC2 m2.4xlarge 68.4 GB $2.00/hr 109.14
EC2 m2.xlarge 17.1 GB $0.50/hr 103.35
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 100.87
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 92.5
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 90.79
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 90.37
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 84.2
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 78.51
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 76.04
EC2 m1.large 7.5 GB $0.34/hr 61.8
EC2 m1.small 1.7 GB $0.085/hr 22.24

Case Study #2: Microsoft Azure Named Fastest Cloud Service

In October 2011, Compuware published a blog post related to cloud performance. This post was picked up by various media outlets resulting in the following headlines:
Here's how the test worked in a nutshell:
  • Two sample e-commerce web pages were created. The first with items description and 40 thumbnails (product list page), and the second with a single 1.75 MB image (product details page)
  • These pages were made accessible using a Java application server (Tomcat 6) running in each cloud environment. The exception to this is Microsoft Azure and Google AppEngine (platform-as-a-service/PaaS environments) which required the pages to be bundled and deployed using their specific technology stack
  • 30 monitoring servers/nodes were instructed to request these 2 pages in succession every 15 minutes and record the amount of time it took to render both in their entirety (including the embedded images)
  • The 30 monitoring nodes are located in data centers in North America (19), Europe (5), Asia (3), Australia (1) and South America (2) - they are part of the Gomez Performance Network (GPN) monitoring service
  • After 1 year an average response time was calculated for each service (response times above 10 seconds were discarded)
Now lets dig a little deeper...
Questions & Answers

What is the claim? Microsoft Azure is the "fastest cloud"
What is the claimed measurement? Overall performance (it's fastest)
What is the actual measurement? Network Latency & Throughput
Rendering 2 html pages and some images is not CPU intensive and as such is not a measure of system performance. The main bottleneck is network latency and throughput, particularly to distant monitoring nodes (e.g. Australia to US)
Is it an apples-to-apples comparison? Types of services tested are different (IaaS vs PaaS) and the instance types are dissimilar
Microsoft Azure and Google AppEngine are platform-as-a-service (PaaS) environments, very different from infrastructure-as-a-service (IaaS) environments like EC2 and GoGrid. With PaaS, users must package and deploy applications using custom tools and more limited capabilities. Applications are deployed to large clustered, multi-tenant environments. Because of the greater structure and more limited capabilities of PaaS, providers are able to better optimize and scale those applications, often resulting in better performance and availability when compared to a single server IaaS deployment. Not much information is disclosed regarding the sizes of instances used for the IaaS services. With some IaaS providers, network performance can vary depending on instance size. For example, with Rackspace Cloud, a 256MB cloud server is capped with a 10 Mbps uplink. With EC2, bandwidth is shared across all instances deployed to a physical host. Smaller instance sizes generally have less, and more variable bandwidth. This test was conducted using the nearly smallest EC2 instance size, an m1.small.
Is the playing field level? Services may have unfair advantage due network proximity and uplink performance
Because network latency is the main bottleneck for this test, and only a handful of monitoring nodes were used, the results are highly dependent on network proximity and latency between the services tested and the monitoring nodes. For example, the Chicago monitoring node might be sitting in the same building as the Azure US Central servers giving Azure and unfair advantage in the test. Additionally, the IaaS services where uplinks are capped on smaller instance types would be at a disadvantage to uncapped PaaS and IaaS environments.
Was the data reported accurately? Simple average was reported - no median, standard deviation or regional breakouts were provided
The CloudSleuth post provided a single metric only… the average response time for each service across all monitoring nodes. A better way to report this data would involve breaking the data down by region. For example, average response time for eastern US monitoring nodes. Reporting median, standard deviation and 90th percentile statistical calculations would also be very helpful in evaluating the data.
Does it matter to you? Probably not
Unless your users are sitting in the same 30 data centers as the GPN monitoring nodes, this study probably means very little. It does not represent a real world scenario where static content like images would be deployed to a distributed content delivery network like CloudFront or Edgecast. It attempts to compare two different types of cloud services, PaaS and IaaS. It may use IaaS instance types like the EC2 m1.small that represent the worst case performance scenario. The 30 node test population is also very small and not indicative of a real end user population (end users don't sit in data centers). Finally, reporting only a single average value ignores most statistical best practices.