amazon outage
Planning for Outages in the Cloud
By Dave Armlin, Director of Customer Support at CloudSwitch
The Amazon US East outage of just over a week ago was an eye-opener for many people. Here at CloudSwitch it validated what we know about best practices for using the cloud. Not surprisingly, these reflect traditional IT processes and systems that enterprises know are needed to protect data and ensure that applications remain available to users.
From an enterprise support perspective, I’m very happy about the variety of options we have to protect and backup data, scale and shrink application capacity, and bring applications on and off-line. It’s also easier than ever to make sure that you don’t have all of your “eggs in one basket” due to public clouds and products like CloudSwitch.
CloudSwitch was designed to bridge the worlds of the data center and public cloud, making it extremely easy and safe to move virtual machines into the cloud over an encrypted tunnel onto encrypted storage. You can also deploy multiple copies of your virtual machines to different availability zones, regions, or even clouds. Because CloudSwitch acts as a layer-2 bridge between the data center and cloud, virtual machines in the cloud are able to seamlessly access the data center and visa-versa, allowing data and applications to live in either the data center or the cloud. This provides some great opportunities to continue to use your existing IT management tools while taking advantage of cloud storage and cloud compute power in powerful new ways.
We encourage our customers to make full use of the opportunities that CloudSwitch enables:
- Deploy virtual machines to multiple availability zones, regions and/or clouds.
- Clone existing virtual machines or create hot “point in time” snapshots.
- Continue to utilize traditional backup methods.
- Employ traditional file system and/or database replication and load balancing to make your applications as available as possible.
- Automate scripted deployments and life cycle actions on virtual machines.
When reviewing DR strategies with customers, we recommend the following approach:
- Review all applications and associated virtual machines and prioritize them to determine the appropriate DR strategy.
- Review and test monitoring of each application to ensure that you can detect and be alerted to application failures as quickly as possible.
- Eliminate single points of failure for critical applications by providing multiple points of presence in the cloud.
- If you are using replication technology to keep copies of virtual machines in sync, ensure that you have appropriate alerts in place to detect synchronization failures.
- Determine if failover and failback must be automated or manual processes.
- If load balancing is required, weigh the options and limitations of the different load balancers. Amazon’s ELB for instance can balance across availability zones but not regions.
- For lower priority systems that don’t require HA, consider scheduled automated clones/snapshots or traditional backups or both.
Eating Our Own Dogfood
In our own internal operations, we’ve deployed www.cloudswitch.com to both Amazon’s US-East Region and US-West as well as Terremark’s Enterprise cloud. We utilize an open source file system synchronizer to keep these copies in sync. This application also has an automated backup process that backs up data to Amazon S3.
CloudSwitch’s web portal for download/activation and support consists of database and application servers which are deployed to multiple regions within Amazon utilizing database (master-slave) replication.
As our CTO John Considine noted in last week’s blog post, we and our customers had a range of experiences during the Amazon outage, which only reinforced the importance of planning and implementing the procedures outlined above. In some cases, we relied too heavily on snapshots for some of our internal systems, and recognized after the fact that we needed a careful DR review and prioritization of the many applications we have running in the cloud.
Any DR/HA plan should be routinely validated. We’ve taken this recent event as a good opportunity to do this internally and have been working with our customers to remind them of the options that are available to them with CloudSwitch to protect data and make systems more resilient.
Amazon's Outage: Winners and Losers
By John Considine
In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community. Here is my take on the outage, what it means, and how it affected both us and our customers.
First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud. The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying. Almost everyone has experienced this frustrating failure and it is always a frightening event. Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly. When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again? If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.
Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure. Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again? At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data? In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch? The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).
There were three startling revelations from this failure:
- Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems. There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3). This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.
- People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down. Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.
- This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones. These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services). This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone. The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.
This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault. Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t. Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.
Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service. Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland. Each of these regions has at least two availability zones. The failure was primarily focused on one zone within one region. This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected. I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.
So what is required to survive this kind of failure? Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy. We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two.
To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems. In this failure case, they were able to rely on their internal systems to continue their operations. In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center). With the cloud, they can bring up new systems utilizing their backups, and continue with operations. The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.
How did we do? We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted. Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down. On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism; we felt this was sufficient for these applications, and therefore we did not bother to run more traditional backups (or data replication). Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions. In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.
The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.

Digg
Reddit
Delicious
StumbleUpon
Facebook
Twitter
LinkedIn