Saturday, January 15, 2011

Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

  • Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
  • Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
  • Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

SLAs versus Measured Availability

The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.

ProviderData CenterTotal # Outages / Mins DownSLA Credit PolicySLAActual
AWS EC2US East0/0
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%100%
AWS EC2US West0/0
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%100%
GoGridUS West0/0
Pro-rated
100x credit for any downtime
100%100%
Linode VPSLondon0/0
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%100%
OpSource CloudVA, US0/0
Percentage

5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on
100%100%
Storm on DemandMI, US0/0
Pro-rated
10x credit for any downtime
100%100%
VoxCLOUDEU0/0
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%100%
GoGridUS East1/2.3
Pro-rated
100x credit for any downtime
100%99.999%
Joyent Smart MachinesAndover, MA1/3
Percentage

5% of the monthly fee for each 30 minutes of downtime
100%99.999%
VoxCLOUDSingapore1/5.5
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%99.999%
Speedyrails VPSPeer1 Quebec1/2.2
Percentage

3% of monthly fees for every 0.1% of downtime
99.9%99.999%
Rackspace CloudDallas, TX1/8.7
Threshold/ Percentage

5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification
100%199.998%
SoftLayer CloudLayerDallas, TX4/13.9
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.997%
Hosting.comColorado1/1.4
Percentage

1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%199.997%
AWS EC2APAC5/14.8
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%99.996%
LinodeAtlanta10/26.9
Pro-rated

Pro-rated 1x credit for downtime exceeding 0.1%
99.9%99.995%
Joyent Smart MachinesEmeryville, CA4/15.2
Percentage

5% of the monthly fee for each 30 minutes of downtime
100%99.994%
Terremark vCloudFL, US7/37.9Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees100%99.993%
AWS EC2EU West3/36
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%99.993%
Speedyrails VPSCanix Quebec9/38.7
Percentage

3% of monthly fees for every 0.1% of downtime
99.9%99.992%
LinodeFremont, CA13/71.92
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.986%
ZerigoCO, CA9/66.8
Pro-rated

4x the total (starting from 100%, not 99.99%) non-compliant time
99.99%99.985%
SoftLayer CloudLayerDC, US31/86.7
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.984%
SoftLayer CloudLayerWA, US13/106.8
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.980%
LinodeNJ, CA14/145.7
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.972%
VoxCLOUDNY, US12/146.33
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%99.972%
CloudSigmaSwitzerland22/59.9
Threshold/ Percentage
50x credit for any downtime (network or hardware) over 15 minutes
100%99.972%
Hosting.comKY, US4/38.74
Percentage

1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%199.955%
ThePlanet Cloud ServersTX, US34/144.3
Threshold/ Percentage

5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage
100%99.955%
Gandi VPSFrance4/147.7
Pro-rated

1 day credit for every outage over 7 minutes within a single day
99.95%99.955%
LinodeDallas21/258.2
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.951%
NewServersFL, US39/288.7
Pro-rated

24x credit for every 1 hour of downtime exceeding 0.001%
99.999%99.945%
VPS.NETUK8/250.3
Percentage

10% monthly invoice credit for each hour of downtime
100%599.921%
VPS.NETUS Central12/342.9
Percentage

10% monthly invoice credit for each hour of downtime
100%599.892%
FlexiantUK83/820.36
Percentage

5% monthly invoice credit for each 30 minutes of downtime
100%99.844%
VPS.NETUS West32/576.5
Percentage

10% monthly invoice credit for each hour of downtime
100%599.819%
ReliaCloudMN, US23/1941.57
Pro-rated
30x hourly credit for each hour downtime
100%99.626%
VPS.NETUS East6/1224.18
Percentage

10% monthly invoice credit for each hour of downtime
100%599.616%

1 Applies to network connectivity only, not hardware outages

2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode

3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010

4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved

5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.

6 Approximately 560 minutes of these outages occurred due to failure of their SAN

7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures

8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"

Is there a correlation between SLA and actual availability?

The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:

Total # of services analyzed:38
Services that meet or exceeded SLA:15/38 [39%]
Services that did not meet SLA:23/38 [61%]
Vendors with 100% SLAs:23/38 [61%]
Vendors with 100% SLAs achieving their SLA:4/23 [17%]
Mean availability of vendors with 100% SLAs:99.929% [6.22 hrs/yr]
Median availability of vendors with 100% SLAs:99.982% [1.58 hrs/yr]

It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.

Vendors with generous SLA credit policies

In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:

  • GoGrid: provides a 100x credit policy combined with 100% SLA for any hardware and network outages and no minimum thresholds (e.g. 1 hour outage = 100 hour credit). This is by far the most generous of the 38 IaaS vendors we evaluated. GoGrid's service is also one of the most reliable IaaS services we currently monitor (100% US West and 99.999% US East)
  • Joyent: provides a 5% invoice SLA credit for each 30 minutes of monthly (non-continuous) downtime (equates to about 72x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
  • VoxCloud: provides a 5% invoice credit per 0.1% of monthly (non-continuous) downtime (about every 45 minutes - equates to about 48x pro-rated credit) combined with 100% SLA and no minimum outage thresholds

Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed

We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.

Cloud Outages RSS Feed

This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.

getAvailability Web Service

This post includes a small snapshot of the data we maintain on cloud availability. We have released a new web service that allows users to calculate availability and retrieve outage details (including supporting comments) for any of the 90 cloud services we currently monitor. Monitoring for many of these services began between October 2009 and January 2010, but we are also continually adding new services to the list. This web service allows users to calculate availability and retrieve outage information for any time frame, service type, vendor, etc. To get you started, we have provided a few example RESTful request URLs. These example requests return JSON formatted data. To request XML formatted data append &ws-format=xml to any of these URLs. Full API documentation for this web service is provided here. A SOAP WSDL is also provided here. You may invoke this web service for free up to 5 times daily. To purchase a web service token allowing additional web service invocations click here.
Retrieve availability for all IaaS vendors for the past year (first 10 of 46 results)
Retrieve availability for all IaaS vendors for the past year (results 11-20 of 46)
Retrieve availability for all CDNs for 2010 (first 10 of 13 results)
Retrieve availability for all CDNs for 2010 (results 11-13 of 13)
Retrieve availability for all AWS services (EC2, S3, CloudFront) for the past 6 months
Retrieve availability for GoGrid Cloud Servers for the past 2 weeks
Retrieve availability for VPS.net's US East data center since 1/1/2010 - include full outage documentation

Summary

Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.