Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

SLAs versus Measured Availability

The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.

Provider	Data Center	Total # Outages / Mins Down	SLA Credit Policy	SLA	Actual
AWS EC2	US East	0/0	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	100%
AWS EC2	US West	0/0	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	100%
GoGrid	US West	0/0	Pro-rated 100x credit for any downtime	100%	100%
Linode VPS	London	0/0	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	100%
OpSource Cloud	VA, US	0/0	Percentage 5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on	100%	100%
Storm on Demand	MI, US	0/0	Pro-rated 10x credit for any downtime	100%	100%
VoxCLOUD	EU	0/0	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	100%
GoGrid	US East	1/2.3	Pro-rated 100x credit for any downtime	100%	99.999%
Joyent Smart Machines	Andover, MA	1/3	Percentage 5% of the monthly fee for each 30 minutes of downtime	100%	99.999%
VoxCLOUD	Singapore	1/5.5	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	99.999%
Speedyrails VPS	Peer1 Quebec	1/2.2	Percentage 3% of monthly fees for every 0.1% of downtime	99.9%	99.999%
Rackspace Cloud	Dallas, TX	1/8.7	Threshold/ Percentage 5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification	100%¹	99.998%
SoftLayer CloudLayer	Dallas, TX	4/13.9	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.997%
Hosting.com	Colorado	1/1.4	Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair	100%¹	99.997%
AWS EC2	APAC	5/14.8	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	99.996%
Linode	Atlanta	10/26.9	Pro-rated Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.995%
Joyent Smart Machines	Emeryville, CA	4/15.2	Percentage 5% of the monthly fee for each 30 minutes of downtime	100%	99.994%
Terremark vCloud	FL, US	7/37.9	Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees	100%	99.993%
AWS EC2	EU West	3/36	Percentage 10% invoice credit anytime annual uptime falls below 99.5%	99.5%	99.993%
Speedyrails VPS	Canix Quebec	9/38.7	Percentage 3% of monthly fees for every 0.1% of downtime	99.9%	99.992%
Linode	Fremont, CA	13/71.9²	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.986%
Zerigo	CO, CA	9/66.8	Pro-rated 4x the total (starting from 100%, not 99.99%) non-compliant time	99.99%	99.985%
SoftLayer CloudLayer	DC, US	31/86.7	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.984%
SoftLayer CloudLayer	WA, US	13/106.8	Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit	100%¹	99.980%
Linode	NJ, CA	14/145.7	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.972%
VoxCLOUD	NY, US	12/146.3³	Percentage 5% invoice credit per 0.1% downtime up to 100%	100%	99.972%
CloudSigma	Switzerland	22/59.9	Threshold/ Percentage 50x credit for any downtime (network or hardware) over 15 minutes	100%	99.972%
Hosting.com	KY, US	4/38.7⁴	Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair	100%¹	99.955%
ThePlanet Cloud Servers	TX, US	34/144.3	Threshold/ Percentage 5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage	100%	99.955%
Gandi VPS	France	4/147.7	Pro-rated 1 day credit for every outage over 7 minutes within a single day	99.95%	99.955%
Linode	Dallas	21/258.2	Pro-rated 1x credit for downtime exceeding 0.1%	99.9%	99.951%
NewServers	FL, US	39/288.7	Pro-rated 24x credit for every 1 hour of downtime exceeding 0.001%	99.999%	99.945%
VPS.NET	UK	8/250.3	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.921%
VPS.NET	US Central	12/342.9	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.892%
Flexiant	UK	83/820.3⁶	Percentage 5% monthly invoice credit for each 30 minutes of downtime	100%	99.844%
VPS.NET	US West	32/576.5	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.819%
ReliaCloud	MN, US	23/1941.5⁷	Pro-rated 30x hourly credit for each hour downtime	100%	99.626%
VPS.NET	US East	6/1224.1⁸	Percentage 10% monthly invoice credit for each hour of downtime	100%⁵	99.616%

1 Applies to network connectivity only, not hardware outages

2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode

3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010

4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved

5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.

6 Approximately 560 minutes of these outages occurred due to failure of their SAN

7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures

8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"

Is there a correlation between SLA and actual availability?

The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:

Total # of services analyzed:	38
Services that meet or exceeded SLA:	15/38 [39%]
Services that did not meet SLA:	23/38 [61%]
Vendors with 100% SLAs:	23/38 [61%]
Vendors with 100% SLAs achieving their SLA:	4/23 [17%]
Mean availability of vendors with 100% SLAs:	99.929% [6.22 hrs/yr]
Median availability of vendors with 100% SLAs:	99.982% [1.58 hrs/yr]

It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.

Vendors with generous SLA credit policies

In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:

GoGrid: provides a 100x credit policy combined with 100% SLA for any hardware and network outages and no minimum thresholds (e.g. 1 hour outage = 100 hour credit). This is by far the most generous of the 38 IaaS vendors we evaluated. GoGrid's service is also one of the most reliable IaaS services we currently monitor (100% US West and 99.999% US East)
Joyent: provides a 5% invoice SLA credit for each 30 minutes of monthly (non-continuous) downtime (equates to about 72x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
VoxCloud: provides a 5% invoice credit per 0.1% of monthly (non-continuous) downtime (about every 45 minutes - equates to about 48x pro-rated credit) combined with 100% SLA and no minimum outage thresholds

Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed

We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.

Cloud Outages RSS Feed

This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.

getAvailability Web Service

This post includes a small snapshot of the data we maintain on cloud availability. We have released a new web service that allows users to calculate availability and retrieve outage details (including supporting comments) for any of the 90 cloud services we currently monitor. Monitoring for many of these services began between October 2009 and January 2010, but we are also continually adding new services to the list. This web service allows users to calculate availability and retrieve outage information for any time frame, service type, vendor, etc. To get you started, we have provided a few example RESTful request URLs. These example requests return JSON formatted data. To request XML formatted data append &ws-format=xml to any of these URLs. Full API documentation for this web service is provided here. A SOAP WSDL is also provided here. You may invoke this web service for free up to 5 times daily. To purchase a web service token allowing additional web service invocations click here.

Retrieve availability for all IaaS vendors for the past year (first 10 of 46 results)

Retrieve availability for all IaaS vendors for the past year (results 11-20 of 46)

Retrieve availability for all CDNs for 2010 (first 10 of 13 results)

Retrieve availability for all CDNs for 2010 (results 11-13 of 13)

Retrieve availability for all AWS services (EC2, S3, CloudFront) for the past 6 months

Retrieve availability for GoGrid Cloud Servers for the past 2 weeks

Retrieve availability for VPS.net's US East data center since 1/1/2010 - include full outage documentation

Summary

Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.

5 comments:

UnknownFebruary 7, 2011 at 6:49 AM
They matter if you're pitching a proposal to an enterprise, however in the real world they don't matter that much!
ushasajMarch 7, 2011 at 8:12 AM
MAKE THE DAY WORTHY AS MUCH AS POSSIBLE
Bangon KaliApril 22, 2011 at 1:30 PM
Well, Amazon is out today. :(
diparud2009December 9, 2011 at 12:15 PM
I was looking for this chart of various cloud services since a long time.At last i found it here.
AnonymousNovember 13, 2012 at 11:32 AM
Great article. Thanks.

Saturday, January 15, 2011