aws
Amazon's Outage: Winners and Losers
By John Considine
In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community. Here is my take on the outage, what it means, and how it affected both us and our customers.
First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud. The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying. Almost everyone has experienced this frustrating failure and it is always a frightening event. Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly. When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again? If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.
Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure. Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again? At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data? In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch? The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).
There were three startling revelations from this failure:
- Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems. There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3). This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.
- People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down. Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.
- This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones. These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services). This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone. The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.
This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault. Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t. Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.
Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service. Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland. Each of these regions has at least two availability zones. The failure was primarily focused on one zone within one region. This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected. I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.
So what is required to survive this kind of failure? Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy. We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two.
To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems. In this failure case, they were able to rely on their internal systems to continue their operations. In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center). With the cloud, they can bring up new systems utilizing their backups, and continue with operations. The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.
How did we do? We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted. Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down. On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism; we felt this was sufficient for these applications, and therefore we did not bother to run more traditional backups (or data replication). Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions. In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.
The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.
AWS and Freedom of Speech?
By John McEleney
The blogosphere and twitter have been in overdrive the past couple of days with the removal of WikiLeaks from AWS. The reaction and condemnation of Amazon has been swift and often brutal – charging the company with censorship and cowardly behavior. Consider the announcement from WikiLeaks on Twitter:
“WikiLeaks servers at Amazon ousted. Free speech the land of the free — fine our $ are now spent to employ people in Europe.”
Even the New York Times is fanning the flame by suggesting that Amazon yielded to political pressure from Senator Lieberman: “WikiLeaks’ illegal, outrageous, and reckless acts have compromised our national security and put lives at risk around the world,” Mr. Lieberman said. “No responsible company – whether American or foreign – should assist WikiLeaks in its efforts to disseminate these stolen materials.”
It’s very clear that WikiLeaks violated their terms of service; in fact Amazon posted this announcement on their AWS site:
Amazon Web Services (AWS) rents computer infrastructure on a self-service basis. AWS does not pre-screen its customers, but it does have terms of service that must be followed. WikiLeaks was not following them. There were several parts they were violating. For example, our terms of service state that “you represent and warrant that you own or otherwise control all of the rights to the content…, that use of the content you supply does not violate this policy and will not cause injury to any person or entity.” It’s clear that WikiLeaks doesn’t own or otherwise control all the rights to this classified content.
I believe the decision by Amazon was neither censorship nor cowardly. If I had to choose a word to express the action taken, I would call it consistent. It is consistent with the agreement that end users accept when they use AWS. I applaud Amazon for taking this action. While there are valid arguments for both sides of the WikiLeaks issue, these are part of a much broader debate over the democratization of information enabled by the internet and the moral code that journalists in the print media have lived by for so many years. For Amazon, the issue is more specifically related to the nature of the WikiLeaks content.
Sometimes it is useful to examine this type of decision at a personal level. Imagine that tomorrow someone steals some of your own personal property and tries to sell it on eBay. Wouldn’t you expect eBay to respond to your request by removing the offending posting? That’s exactly what Amazon did – once alerted to the fact that WikiLeaks was using AWS to distribute material that did not belong to them, Amazon took the controversial, but proper step (consistent with their terms of usage) of discontinuing WikiLeaks’ service.
Conspiracy theorists will say that Obama and half the government called Jeff Bezos and demanded that he stop WikiLeaks or else… my guess is that the truth is that the AWS team simply looked at the material and made the decision to terminate their access because it violated their terms of usage.
This is plain and simple – no major conspiracy, no attack on freedom of speech, just consistent business practices – which is exactly what you and I should expect from a leading cloud provider.

Digg
Reddit
Delicious
StumbleUpon
Facebook
Twitter
LinkedIn