In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.
The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.
SLA Credit Policies
In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:
- Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
- Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
- Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%
The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.
SLAs versus Measured Availability
The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.
Provider | Data Center | Total # Outages / Mins Down | SLA Credit Policy | SLA | Actual |
---|---|---|---|---|---|
AWS EC2 | US East | 0/0 | Percentage 10% invoice credit anytime annual uptime falls below 99.5% | 99.5% | 100% |
AWS EC2 | US West | 0/0 | Percentage 10% invoice credit anytime annual uptime falls below 99.5% | 99.5% | 100% |
GoGrid | US West | 0/0 | Pro-rated 100x credit for any downtime | 100% | 100% |
Linode VPS | London | 0/0 | Pro-rated 1x credit for downtime exceeding 0.1% | 99.9% | 100% |
OpSource Cloud | VA, US | 0/0 | Percentage 5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on | 100% | 100% |
Storm on Demand | MI, US | 0/0 | Pro-rated 10x credit for any downtime | 100% | 100% |
VoxCLOUD | EU | 0/0 | Percentage 5% invoice credit per 0.1% downtime up to 100% | 100% | 100% |
GoGrid | US East | 1/2.3 | Pro-rated 100x credit for any downtime | 100% | 99.999% |
Joyent Smart Machines | Andover, MA | 1/3 | Percentage 5% of the monthly fee for each 30 minutes of downtime | 100% | 99.999% |
VoxCLOUD | Singapore | 1/5.5 | Percentage 5% invoice credit per 0.1% downtime up to 100% | 100% | 99.999% |
Speedyrails VPS | Peer1 Quebec | 1/2.2 | Percentage 3% of monthly fees for every 0.1% of downtime | 99.9% | 99.999% |
Rackspace Cloud | Dallas, TX | 1/8.7 | Threshold/ Percentage 5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification | 100%1 | 99.998% |
SoftLayer CloudLayer | Dallas, TX | 4/13.9 | Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit | 100%1 | 99.997% |
Hosting.com | Colorado | 1/1.4 | Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair | 100%1 | 99.997% |
AWS EC2 | APAC | 5/14.8 | Percentage 10% invoice credit anytime annual uptime falls below 99.5% | 99.5% | 99.996% |
Linode | Atlanta | 10/26.9 | Pro-rated Pro-rated 1x credit for downtime exceeding 0.1% | 99.9% | 99.995% |
Joyent Smart Machines | Emeryville, CA | 4/15.2 | Percentage 5% of the monthly fee for each 30 minutes of downtime | 100% | 99.994% |
Terremark vCloud | FL, US | 7/37.9 | Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees | 100% | 99.993% |
AWS EC2 | EU West | 3/36 | Percentage 10% invoice credit anytime annual uptime falls below 99.5% | 99.5% | 99.993% |
Speedyrails VPS | Canix Quebec | 9/38.7 | Percentage 3% of monthly fees for every 0.1% of downtime | 99.9% | 99.992% |
Linode | Fremont, CA | 13/71.92 | Pro-rated 1x credit for downtime exceeding 0.1% | 99.9% | 99.986% |
Zerigo | CO, CA | 9/66.8 | Pro-rated 4x the total (starting from 100%, not 99.99%) non-compliant time | 99.99% | 99.985% |
SoftLayer CloudLayer | DC, US | 31/86.7 | Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit | 100%1 | 99.984% |
SoftLayer CloudLayer | WA, US | 13/106.8 | Threshold/ Percentage 5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit | 100%1 | 99.980% |
Linode | NJ, CA | 14/145.7 | Pro-rated 1x credit for downtime exceeding 0.1% | 99.9% | 99.972% |
VoxCLOUD | NY, US | 12/146.33 | Percentage 5% invoice credit per 0.1% downtime up to 100% | 100% | 99.972% |
CloudSigma | Switzerland | 22/59.9 | Threshold/ Percentage 50x credit for any downtime (network or hardware) over 15 minutes | 100% | 99.972% |
Hosting.com | KY, US | 4/38.74 | Percentage 1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair | 100%1 | 99.955% |
ThePlanet Cloud Servers | TX, US | 34/144.3 | Threshold/ Percentage 5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage | 100% | 99.955% |
Gandi VPS | France | 4/147.7 | Pro-rated 1 day credit for every outage over 7 minutes within a single day | 99.95% | 99.955% |
Linode | Dallas | 21/258.2 | Pro-rated 1x credit for downtime exceeding 0.1% | 99.9% | 99.951% |
NewServers | FL, US | 39/288.7 | Pro-rated 24x credit for every 1 hour of downtime exceeding 0.001% | 99.999% | 99.945% |
VPS.NET | UK | 8/250.3 | Percentage 10% monthly invoice credit for each hour of downtime | 100%5 | 99.921% |
VPS.NET | US Central | 12/342.9 | Percentage 10% monthly invoice credit for each hour of downtime | 100%5 | 99.892% |
Flexiant | UK | 83/820.36 | Percentage 5% monthly invoice credit for each 30 minutes of downtime | 100% | 99.844% |
VPS.NET | US West | 32/576.5 | Percentage 10% monthly invoice credit for each hour of downtime | 100%5 | 99.819% |
ReliaCloud | MN, US | 23/1941.57 | Pro-rated 30x hourly credit for each hour downtime | 100% | 99.626% |
VPS.NET | US East | 6/1224.18 | Percentage 10% monthly invoice credit for each hour of downtime | 100%5 | 99.616% |
1 Applies to network connectivity only, not hardware outages
2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode
3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010
4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved
5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.
6 Approximately 560 minutes of these outages occurred due to failure of their SAN
7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures
8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"
Is there a correlation between SLA and actual availability?
The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:
Total # of services analyzed: | 38 |
---|---|
Services that meet or exceeded SLA: | 15/38 [39%] |
Services that did not meet SLA: | 23/38 [61%] |
Vendors with 100% SLAs: | 23/38 [61%] |
Vendors with 100% SLAs achieving their SLA: | 4/23 [17%] |
Mean availability of vendors with 100% SLAs: | 99.929% [6.22 hrs/yr] |
Median availability of vendors with 100% SLAs: | 99.982% [1.58 hrs/yr] |
It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.
Vendors with generous SLA credit policies
In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:
- GoGrid: provides a 100x credit policy combined with 100% SLA for any hardware and network outages and no minimum thresholds (e.g. 1 hour outage = 100 hour credit). This is by far the most generous of the 38 IaaS vendors we evaluated. GoGrid's service is also one of the most reliable IaaS services we currently monitor (100% US West and 99.999% US East)
- Joyent: provides a 5% invoice SLA credit for each 30 minutes of monthly (non-continuous) downtime (equates to about 72x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
- VoxCloud: provides a 5% invoice credit per 0.1% of monthly (non-continuous) downtime (about every 45 minutes - equates to about 48x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed
We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.
Cloud Outages RSS Feed
This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.
getAvailability Web Service
Summary
Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.
They matter if you're pitching a proposal to an enterprise, however in the real world they don't matter that much!
ReplyDeleteMAKE THE DAY WORTHY AS MUCH AS POSSIBLE
ReplyDeleteWell, Amazon is out today. :(
ReplyDeleteI was looking for this chart of various cloud services since a long time.At last i found it here.
ReplyDeleteGreat article. Thanks.
ReplyDelete