Enterprise Cloud Computing Blog

cloud storage

Amazon's Outage: Winners and Losers

By John Considine

In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community.  Here is my take on the outage, what it means, and how it affected both us and our customers.

First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud.  The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying.  Almost everyone has experienced this frustrating failure and it is always a frightening event.  Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly.  When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again?  If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.

Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure.  Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again?  At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data?  In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch?  The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).

There were three startling revelations from this failure:

  1. Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems.  There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3).  This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.

  2. People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down.  Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.

  3. This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones.  These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services).  This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone.  The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.

This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault.  Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t.  Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.

Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service.  Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland.  Each of these regions has at least two availability zones.  The failure was primarily focused on one zone within one region.  This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected.  I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.

So what is required to survive this kind of failure?  Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy.  We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two. 

To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems.  In this failure case, they were able to rely  on their internal systems to continue their operations.  In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center).  With the cloud, they can bring up new systems utilizing their backups, and continue with operations.  The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.

How did we do?  We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted.  Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down.  On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism;  we felt this was sufficient for these applications, and therefore  we did not bother to run more traditional backups (or data replication).  Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions.  In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.

The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.

File under: ,
0 comment(s) so far...

Moving to the Cloud: Key Considerations for Cloud Storage

This post is part of a series examining the issues involved when moving applications between internal data centers and public clouds.

The true challenges in storage and data management in the cloud result from the diverse and often unfamiliar processes and infrastructures offered by the cloud providers, including: new provisioning methods, storage properties, data population and transfer, and systems for data management (snapshots, clones, replication, backup). The cloud providers define the relationship between servers and storage and often impose constraints on everything from allocation size limits to the ways in which storage is managed. These are just some of the things you’ll want to consider as you start to think about integrating cloud computing into your existing IT environments.

I’d like to focus in detail on the complexity and variability of cloud provisioning and storage properties. There are different models for storage in existing compute clouds, with the most common being an “inclusive” storage model. In this model, each server comes with a certain amount of storage attached to it. The storage is a fixed capacity that is provisioned when you create the server from the pre-existing templates.

For example, Rackspace gives you disk space that is proportional to the memory (RAM) size you select.  The smallest memory/disk combination is 256MB of memory with 10GB of disk. With each doubling of memory, the disk space is also doubled until you get to roughly 16GB of memory and 640GB of disk.  With the new Terremark vCloud Express, you get a system disk that is predefined for each “template” server you select.  For a standard Linux distribution, you get a 10GB system disk, for Windows 2K3 you get a 20GB disk and for W2K8, you get a 40GB disk. Terremark’s vCloud Express allows you to add additional storage as new disks, while others (like Rackspace) allow to “resize” your servers and storage to create a new server with a larger disk and copy your data into it. 

Amazon offers several distinct types of storage within EC2. The default storage you get with each server you create in the cloud is called “ephemeral” storage. You then have the option of allocating and attaching Elastic Block Storage (EBS), and there is also an object store system called Simple Storage Service (S3).  Ephemeral and EBS are standard “block storage” devices – meaning they are viewed and used as disks attached to your server (/dev/sdg in Linux, D: in Windows) while S3 requires an API or other tools to integrate with your systems. The good part about the EC2 storage offerings is that you have some powerful options as you build for the cloud; the hard part is mapping the proper resources to your applications and integrating this with your existing processes. Specifically, the base storage is ephemeral, which means that if you power-off the server, or it has a hard fault, all the data on that storage is lost. This means that everything on these drives (boot parameters, application updates, user data, logs, etc.) is subject to loss when you power off the machine. There are several methods of handling this situation: 1) Build your servers every time you start them from a formula or other sources such that you don’t depend on the base storage being persistent; 2) Use Amazon or third party tool sets to periodically “bundle” your servers into S3 (effectively taking a snapshot of the server); or 3) Attach EBS storage to your image and store your important data on persistent storage.

Turning to granularity, we find a wide range in the units or increments of available storage in the various cloud providers. There is the “included” storage mentioned above that is often based on the size of the server and the requested OS type. To add storage, we find cloud providers (such as Amazon) allowing 1GB increments up to 1TB, and others (like Flexiscale) allowing only fixed increments of 50GB/100GB/250GB. For Rackspace, you can resize both the server and storage according to the defined fixed ratios, but these are bound to memory and CPU so there is no independent scaling of storage. The bare-metal cloud provider NewServers allows iSCSI storage to be attached to your servers in 250GB increments. In the cloud these varied increments really matter, because you are paying by the GB/month and if you need just a little more storage, you could end up having to purchase 10x more storage than you need, or having to pay for more memory and compute than you need.

The conclusion we can draw is that there are numerous storage configuration options in the cloud, and these options become linked to the server “flavors” defined by individual cloud providers. Because you don’t have the same control or even mechanisms in the cloud as you do in the local data center, the manner in which you allocate, populate and manage data in the cloud will be different. The work you do to understand and map your applications’ requirements into cloud-based storage requires changing your processes to match those of the cloud, and often this work is cloud-specific. 

Beyond configuration issues, of course, there are many other concerns. For example, with data management, you have to determine how you will get your data into the system, how to grow your systems and how to protect your data. Right now most clouds use template servers that you have to build up from a pre-installed base operating system using update mechanisms and then re-installing the application components. As for protecting your data, there are also many cloud-specific options available – from RAID-protected EBS in Amazon, to data snapshots and cloning, to backup services offered by companies like Rackspace.

The bottom line is that cloud storage can be simultaneously simple and complex – just like cloud computing in general. It’s simple to use if you just want to try something new; complex if you want to integrate cloud storage into your existing processes and infrastructures.

Next: Networking in the Cloud

0 comment(s) so far...