Many are skeptical of claims that involve benchmarks. Over the years benchmarks have been manipulated and misrepresented. Benchmarks aren't inherently bad or created in bad faith. To the contrary, when understood and applied correctly, benchmarks can often provide useful insight for performance analysis and capacity planning. The problem with benchmarks is they are often misunderstood or misrepresented, frequently resulting in bold assertions and questionable claims. Oftentimes there are also extraneous factors involved such as agenda-driven marketing organizations. In fact, the term "benchmarketing" was coined to describe questionable marketing-driven, benchmark-based claims.
This post will discuss a few questions one might consider when reading benchmark-based claims. We'll then apply these questions to 2 recent cloud related, benchmark-based studies.
Questions to consider
The following are 7 questions one might ask when considering benchmark-based claims. Answering these questions will help to provide a clearer understanding on the validity and applicability of the claims.
- What is the claim?
Typically the bold-face, attention grabbing headline like Service Y is 10X faster than Service Z
- What is the claimed measurement?
Usually implied by the headline. For example the claim Service Y is 10X faster than Service Z implies a measurement of system performance
- What is the actual measurement?
To answer this question, look at the methodology and benchmark(s) used. This may require some digging, but can usually be found somewhere in the article body. Once found, do some research to determine what was actually measured. For example, if Geekbench was used, you would discover the actual measurement is processor and memory performance, but not disk or network IO
- Is it an apples-to-apples comparison? The validity of a benchmark-based claim ultimately depends on the fairness of the testing methodology. Claims involving comparisons should compare similar things. For example, Ford could compare a Mustang Shelby GT500 (top speed 190 MPH) to a Chevy Aveo (top speed 100 MPH) and claim their cars are nearly twice as fast, but the Aveo is not a comparable vehicle and therefore the claim would be invalid. A more fair, apples-to-apples comparison would be a Mustang GT500 and a Chevy Camaro ZL1 (top speed 186).
- Is the playing field level? Another important question to ask is whether or not there are any extraneous factors that provided an unfair advantage to one test subject over another. For example, using the top speed analogy, Ford could compare a Mustang with 92 octane fuel and a downhill course to a Camaro with 85 octane fuel and an uphill course. Because there are extraneous factors (fuel and angle of the course) which provided an unfair advantage to the Mustang, the claim would be invalid. To be fair, the top speeds of both vehicles should be measured on the same course, with the same fuel, fuel quantity, driver and weather conditions.
- Was the data reported accurately? Benchmarking often results in large datasets. Summarizing the data concisely and accurately can be challenging. Things to watch out for include lack of good statistical analysis (i.e. reporting average only), math errors, and sloppy calculations. For example, if large, highly variable data is collected, it is generally a best practice to report the median value in place of mean (average) to mitigate the effects of outliers. Standard deviation is also a useful metric to include to identify data consistency.
- Does it matter to you? The final question to ask is, assuming the results are valid, does it actually mean anything to you? For example, purchasing a vehicle based on a top speed comparison is not advisable if fuel economy is what really matters to you.
Case Study #1: Joyent Cloud versus AWS EC2
In this case study, Joyent sponsored a third party benchmarking study to compare Joyent Cloud to AWS EC2. The study utilized our own (CloudHarmony) benchmarking methodology to compare 3 categories of performance: CPU, Disk IO and Memory. The end results of the study are published on the Joyent website
available here. In the table below, we'll apply the questions listed above to this study. Answers will be color coded
green where the study provided a positive response to the question, and
red where the results are misleading or misrepresented.
Questions & Answers
What is the claim? |
Joyent Cloud is 3x - 14x Faster than AWS EC2 |
The claims are broken down by measurement type (CPU, Disk IO, Memory), and OS type (SmartOS/Open Solaris, Linux). The resulting large, colorful icons on the Joyent website claim that Joyent Cloud is faster than EC2 by a margin of 3x - 14x |
What is the claimed measurement? |
CPU, Disk IO, Memory Performance |
Our benchmarking methodology was used to measure these different categories of performance. This methodology consists of running multiple benchmarks per category and creating a composite measurement based on a summary of the results for all benchmarks in each category. The methodology is described in more detail on our blog here (CPU), here (Disk IO) and here (Memory). |
What is the actual measurement? |
CPU, Disk IO, Memory Performance |
|
Is it an apples-to-apples comparison? |
Dissimilar instance types were compared |
In the Linux comparison, Joyent claims 5x faster CPU, 3x faster Disk IO, and 4x faster memory. Based on the report details, it appears those ratios originate from comparing a 1GB Joyent VM to an EC2 m1.small. This selection provided the largest performance differential and hence the biggest claim. While price-wise, these instance types are similar (disregarding m1.small spot and reserve pricing where it is 1/2 the cost), that is where the similarities stop. At the time of this report, m1.small was the slowest EC2 instance with a single core and older CPU, while Joyent's 1GB instance type has 2 burstable cores and a newer CPU. The m1.small is not intended for compute intensive tasks. For that type of workload EC2 offers other options with newer CPUs and more cores. To provide an apples-to-apples comparison on performance, the claim should be based on 2 instance types that are intended for such a purpose (e.g. an EC2 m2 or cc1). |
Is the playing field level? |
Operating system and storage type were different |
The study compares Joyent Cloud VMs running SmartOS or Ubuntu 10.04 to AWS EC2 VMs running CentOS 5.4. Joyent's SmartOS is based on Open Solaris and highly optimized for the Joyent environment. Ubuntu 10.04 uses Linux Kernel 2.6.32 (release date: Dec 2009) which is over 3 years newer than the 2.6.18 kernel (release date: Sep 2006) in CentOS 5.4. Newer and more optimized operating systems will almost always perform better for similar tasks on identical hardware. This provided an advantage to the Joyent VMs from the offset.
Additionally, the tests compared EC2 instances running on networked storage (EBS) to Joyent instances running on local storage, which also provided an advantage to the Joyent VMs for the disk IO benchmarks. |
Was the data reported accurately? |
Mistakes were made in calculations |
This study was based on a cloud performance comparison methodology we (CloudHarmony) developed for a series of blog posts in 2010. For CPU performance, we developed an algorithm that combined the results of 19 different CPU benchmarks to provide a single performance metric that attempts to approximate the AW ECU (Elastic Compute Unit). To do so, we utilized EC2 instances and their associated ECU value as a baseline. We called this metric CCU and the algorithm for producing it was described in this blog post. Part of the algorithm involved calculating CCU when performance exceeded the largest baseline EC2 instance type, the 26 ECU m2.4xlarge. In our algorithm we used the performance differential ratio between an m1.small (1 ECU) and m2.4xlarge (26 ECUs). The third party, however, used the ratio between an m2.2xlarge (13 ECUs) and m2.4xlarge (26 ECUs). Because m2s run on the same hardware type, the performance difference between an m2.2xlarge and an m2.4xlarge is not very great, but the difference in ECUs is very high. The end results was their calculations producing a very high CCU value for the Joyent instances (in the range of 58-67 CCUs). Had the correct algorithm been used, the reported CCUs would have been much lower. |
Does it matter to you? |
Probably not |
There isn't much value or validity to the data provided in these reports. The bold headlines which state Joyent Cloud is 3X - 14X faster than EC2 are based on very shaky grounds. In fact, with Joyent's approval, we recently ran our benchmarks in their environment resulting in the following CPU, disk IO and memory performance metrics:
CloudHarmony Generated Joyent/EC2 Performance Comparison
CPU Performance: AWS EC2 vs Joyent |
View Full Report |
Provider |
Instance Type |
Memory |
Cost |
CCU |
EC2 |
cc1.4xlarge |
23 GB |
$1.30/hr |
33.5 |
Joyent |
XXXL 48GB (8 CPU) |
48 GB |
$1.68/hr |
28.44 |
EC2 |
m2.4xlarge |
68.4 GB |
$2.00/hr |
26 |
EC2 |
m2.2xlarge |
34.2 GB |
$1.00/hr |
13 |
Joyent |
XL 16GB (3 CPU) |
16 GB |
$0.64/hr |
10.94 |
Joyent |
XXL 32GB (4 CPU) |
32 GB |
$1.12/hr |
6.82 |
EC2 |
m2.xlarge |
17.1 GB |
$0.50/hr |
6.5 |
Joyent |
Large 8GB (2 CPU) |
8 GB |
$0.36/hr |
6.19 |
Joyent |
Medium 4GB (1 CPU) |
4 GB |
$0.24/hr |
5.53 |
Joyent |
Medium 2GB (1 CPU) |
2 GB |
$0.17/hr |
5.45 |
Joyent |
Small 1GB (1 CPU) |
1 GB |
$0.085/hr |
4.66 |
EC2 |
m1.large |
7.5 GB |
$0.34/hr |
4 |
EC2 |
m1.small |
1.7 GB |
$0.085/hr |
1 |
Disk IO Performance: AWS EC2 vs Joyent |
View Full Report - Note: the EC2 instances labeled EBS utilized a single networked storage volume - better performance may be possible using local storage or multiple EBS volumes. All Joyent instances utilized local storage (networked storage is not available). |
Provider |
Instance Type |
Memory |
Cost |
IOP |
EC2 |
cc1.4xlarge (local storage - raid 0) |
23 GB |
$1.30/hr |
212.06 |
EC2 |
cc1.4xlarge (local storage) |
23 GB |
$1.30/hr |
194.29 |
Joyent |
XXXL 48GB (8 CPU) |
48 GB |
$1.68/hr |
187.38 |
Joyent |
XL 16GB (3 CPU) |
16 GB |
$0.64/hr |
144.71 |
Joyent |
XXL 32GB (4 CPU) |
32 GB |
$1.12/hr |
142.19 |
Joyent |
Large 8GB (2 CPU) |
8 GB |
$0.36/hr |
130.84 |
Joyent |
Medium 4GB (1 CPU) |
4 GB |
$0.24/hr |
110.78 |
Joyent |
Medium 2GB (1 CPU) |
2 GB |
$0.17/hr |
109.2 |
EC2 |
m2.2xlarge (EBS) |
34.2 GB |
$1.00/hr |
87.58 |
EC2 |
m2.xlarge (EBS) |
17.1 GB |
$0.50/hr |
83.62 |
EC2 |
m2.4xlarge (EBS) |
68.4 GB |
$2.00/hr |
82.79 |
EC2 |
m1.large (EBS) |
7.5 GB |
$0.34/hr |
56.82 |
Joyent |
Small 1GB (1 CPU) |
1 GB |
$0.085/hr |
56.08 |
EC2 |
m1.small (EBS) |
1.7 GB |
$0.085/hr |
27.08 |
Memory Performance: AWS EC2 vs Joyent |
View Full Report |
Provider |
Instance Type |
Memory |
Cost |
CCU |
EC2 |
cc1.4xlarge |
23 GB |
$1.30/hr |
137.2 |
EC2 |
m2.2xlarge |
34.2 GB |
$1.00/hr |
109.41 |
EC2 |
m2.4xlarge |
68.4 GB |
$2.00/hr |
109.14 |
EC2 |
m2.xlarge |
17.1 GB |
$0.50/hr |
103.35 |
Joyent |
XL 16GB (3 CPU) |
16 GB |
$0.64/hr |
100.87 |
Joyent |
XXXL 48GB (8 CPU) |
48 GB |
$1.68/hr |
92.5 |
Joyent |
XXL 32GB (4 CPU) |
32 GB |
$1.12/hr |
90.79 |
Joyent |
Large 8GB (2 CPU) |
8 GB |
$0.36/hr |
90.37 |
Joyent |
Medium 2GB (1 CPU) |
2 GB |
$0.17/hr |
84.2 |
Joyent |
Small 1GB (1 CPU) |
1 GB |
$0.085/hr |
78.51 |
Joyent |
Medium 4GB (1 CPU) |
4 GB |
$0.24/hr |
76.04 |
EC2 |
m1.large |
7.5 GB |
$0.34/hr |
61.8 |
EC2 |
m1.small |
1.7 GB |
$0.085/hr |
22.24 |
|
Case Study #2: Microsoft Azure Named Fastest Cloud Service
In October 2011, Compuware
published a blog post related to cloud performance. This post was picked up by various media outlets resulting in the following headlines:
Here's how the test worked in a nutshell:
- Two sample e-commerce web pages were created. The first with items description and 40 thumbnails (product list page), and the second with a single 1.75 MB image (product details page)
- These pages were made accessible using a Java application server (Tomcat 6) running in each cloud environment. The exception to this is Microsoft Azure and Google AppEngine (platform-as-a-service/PaaS environments) which required the pages to be bundled and deployed using their specific technology stack
- 30 monitoring servers/nodes were instructed to request these 2 pages in succession every 15 minutes and record the amount of time it took to render both in their entirety (including the embedded images)
- The 30 monitoring nodes are located in data centers in North America (19), Europe (5), Asia (3), Australia (1) and South America (2) - they are part of the Gomez Performance Network (GPN) monitoring service
- After 1 year an average response time was calculated for each service (response times above 10 seconds were discarded)
Now lets dig a little deeper...
Questions & Answers
What is the claim? |
Microsoft Azure is the "fastest cloud" |
|
What is the claimed measurement? |
Overall performance (it's fastest) |
|
What is the actual measurement? |
Network Latency & Throughput |
Rendering 2 html pages and some images is not CPU intensive and as such is not a measure of system performance. The main bottleneck is network latency and throughput, particularly to distant monitoring nodes (e.g. Australia to US) |
Is it an apples-to-apples comparison? |
Types of services tested are different (IaaS vs PaaS) and the instance types are dissimilar |
Microsoft Azure and Google AppEngine are platform-as-a-service (PaaS) environments, very different from infrastructure-as-a-service (IaaS) environments like EC2 and GoGrid. With PaaS, users must package and deploy applications using custom tools and more limited capabilities. Applications are deployed to large clustered, multi-tenant environments. Because of the greater structure and more limited capabilities of PaaS, providers are able to better optimize and scale those applications, often resulting in better performance and availability when compared to a single server IaaS deployment. Not much information is disclosed regarding the sizes of instances used for the IaaS services. With some IaaS providers, network performance can vary depending on instance size. For example, with Rackspace Cloud, a 256MB cloud server is capped with a 10 Mbps uplink. With EC2, bandwidth is shared across all instances deployed to a physical host. Smaller instance sizes generally have less, and more variable bandwidth. This test was conducted using the nearly smallest EC2 instance size, an m1.small. |
Is the playing field level? |
Services may have unfair advantage due network proximity and uplink performance |
Because network latency is the main bottleneck for this test, and only a handful of monitoring nodes were used, the results are highly dependent on network proximity and latency between the services tested and the monitoring nodes. For example, the Chicago monitoring node might be sitting in the same building as the Azure US Central servers giving Azure and unfair advantage in the test. Additionally, the IaaS services where uplinks are capped on smaller instance types would be at a disadvantage to uncapped PaaS and IaaS environments. |
Was the data reported accurately? |
Simple average was reported - no median, standard deviation or regional breakouts were provided |
The CloudSleuth post provided a single metric only… the average response time for each service across all monitoring nodes. A better way to report this data would involve breaking the data down by region. For example, average response time for eastern US monitoring nodes. Reporting median, standard deviation and 90th percentile statistical calculations would also be very helpful in evaluating the data. |
Does it matter to you? |
Probably not |
Unless your users are sitting in the same 30 data centers as the GPN monitoring nodes, this study probably means very little. It does not represent a real world scenario where static content like images would be deployed to a distributed content delivery network like CloudFront or Edgecast. It attempts to compare two different types of cloud services, PaaS and IaaS. It may use IaaS instance types like the EC2 m1.small that represent the worst case performance scenario. The 30 node test population is also very small and not indicative of a real end user population (end users don't sit in data centers). Finally, reporting only a single average value ignores most statistical best practices.
|