Enterprise Cloud Computing Blog

amazon

Planning for Outages in the Cloud

By Dave Armlin, Director of Customer Support at CloudSwitch

The Amazon US East outage of just over a week ago was an eye-opener for many people. Here at CloudSwitch it validated what we know about best practices for using the cloud. Not surprisingly, these reflect traditional IT processes and systems that enterprises know are needed to protect data and ensure that applications remain available to users.

From an enterprise support perspective, I’m very happy about the variety of options we have to protect and backup data, scale and shrink application capacity, and bring applications on and off-line. It’s also easier than ever to make sure that you don’t have all of your “eggs in one basket” due to public clouds and products like CloudSwitch.

CloudSwitch was designed to bridge the worlds of the data center and public cloud, making it extremely easy and safe to move virtual machines into the cloud over an encrypted tunnel onto encrypted storage. You can also deploy multiple copies of your virtual machines to different availability zones, regions, or even clouds. Because CloudSwitch acts as a layer-2 bridge between the data center and cloud, virtual machines in the cloud are able to seamlessly access the data center and visa-versa, allowing data and applications to live in either the data center or the cloud. This provides some great opportunities to continue to use your existing IT management tools while taking advantage of cloud storage and cloud compute power in powerful new ways.

We encourage our customers to make full use of the opportunities that CloudSwitch enables:

  • Deploy virtual machines to multiple availability zones, regions and/or clouds.
  • Clone existing virtual machines or create hot “point in time” snapshots.
  • Continue to utilize traditional backup methods.
  • Employ traditional file system and/or database replication and load balancing to make your applications as available as possible.
  • Automate scripted deployments and life cycle actions on virtual machines.

When reviewing DR strategies with customers, we recommend the following approach:

  • Review all applications and associated virtual machines and prioritize them to determine the appropriate DR strategy.
  • Review and test monitoring of each application to ensure that you can detect and be alerted to application failures as quickly as possible.
  • Eliminate single points of failure for critical applications by providing multiple points of presence in the cloud.           
  • If you are using replication technology to keep copies of virtual machines in sync, ensure that you have appropriate alerts in place to detect synchronization failures.
  • Determine if failover and failback must be automated or manual processes.
  • If load balancing is required, weigh the options and limitations of the different load balancers. Amazon’s ELB for instance can balance across availability zones but not regions.
  • For lower priority systems that don’t require HA, consider scheduled automated clones/snapshots or traditional backups or both.

Eating Our Own Dogfood

In our own internal operations, we’ve deployed www.cloudswitch.com to both Amazon’s US-East Region and US-West as well as Terremark’s Enterprise cloud.  We utilize an open source file system synchronizer to keep these copies in sync. This application also has an automated backup process that backs up data to Amazon S3.

CloudSwitch’s web portal for download/activation and support consists of database and application servers which are deployed to multiple regions within Amazon utilizing database (master-slave) replication.

As our CTO John Considine noted in last week’s blog post, we and our customers had a range of experiences during the Amazon outage, which only reinforced the importance of planning and implementing the procedures outlined above.  In some cases, we relied too heavily on snapshots for some of our internal systems, and recognized after the fact that we needed a careful DR review and prioritization of the many applications we have running in the cloud.

Any DR/HA plan should be routinely validated. We’ve taken this recent event as a good opportunity to do this internally and have been working with our customers to remind them of the options that are available to them with CloudSwitch to protect data and make systems more resilient.

1 comment(s) so far...

Amazon's Outage: Winners and Losers

By John Considine

In case you haven’t heard, last week Amazon’s Web Services had an extended outage that affected a lot of cloud users and has created a big stir in the cloud computing community.  Here is my take on the outage, what it means, and how it affected both us and our customers.

First, the outage – Amazon’s Elastic Block Storage (EBS) system failed in one “small” part of Amazon’s mammoth cloud.  The easiest way to identify with this failing system is to think of it as the hard drive in your laptop or home computer dying.  Almost everyone has experienced this frustrating failure and it is always a frightening event.  Your hard disk contains all of the data and programs you know and love (as well as all of the complex configurations that few really understand) that are required to give your computer its identity and run properly.  When it fails, you get that sinking feeling – what have I lost, do I have any backups, how long is it going to take me to get up and running again?  If you are lucky, it’s actually something else that failed, and you can remove the disk drive and plug it into a new computer, or at least recover the data from it.

Well, that is what happened in Amazon – the service that provided disk drives to the virtual machines in the cloud had a failure.  Thousands (or perhaps tens of thousands) of computers suddenly had their hard drives “die”, and the big question was (repeated over and over)– what have I lost, do I have any backups, how long is it going to take me to get running again?  At a more pressing level the question becomes did Amazon lose our data, or just connectivity to that data?  In some circles there is no distinction between the two, but for most people, it just like that laptop – can I get my stuff back, or do I have to start from scratch?  The good news is that it appears that most, if not all, of the data was recovered from this failure – indicating that the failure was in connectivity or that the data protection scheme that Amazon has in place was good enough to recover the data from the failed systems (or both).

There were three startling revelations from this failure:

  1. Cloud storage can fail – Ok, so this should not be startling, but we have not seen a failure like this in Amazon before, and we took it for granted that Amazon had “protected” the data and systems.  There are nice features like “snapshot to S3 storage” that let users make copies of their disks into Amazon’s well known and respected Simple Storage Service (S3).  This feature made people feel safe about their backups – right up to the point that they could not access the snapshots during the failure.

  2. People were using this storage without knowing it – Amazon is using their own infrastructure to deliver their other services as evidenced by other features being degraded or un-available when the EBS system went down.  Some have speculated that in addition to the RDS service and the new Elastic Beanstalk, that some core networking functions could have been affected.

  3. This storage failure apparently jumped across “data centers” - OK, so this is a big one; Amazon encouraged us to build applications for failures and after an outage early in 2008, they introduced the notion of Availability Zones.  These zones were designed to be independent data centers (or at least on separate power supplies, different networks, and not sharing core services).  This would allow companies deploying into Amazon to place servers and applications into different zones (or data centers) to account for the inevitable faults in one zone.  The fact that this issue “spread” to more than one zone is a big deal - those who designed for failures using different zones in Amazon’s east region were surprised to find that they could not “recover” from this failure.

This outage is a major event in cloud computing – the leader in cloud computing had a failure, a service went down for an extended period of time, and lots of companies were impacted by the fault.  Now everyone is looking for the winners and losers – those who survived this outage and those who didn’t.  Those who continued operation with little or no disruption fall into three groups – 1) Those who were lucky, 2) Those who were not using the effected services, 3) Those who had designed for this level of failure.

Based on our experiences and much of what I have read, the majority of the success cases during this failure were related to luck and those who didn’t use the service.  Keep in mind that Amazon is huge, and they have “regions” all over the world including East and West coast of the US, Singapore, Tokyo, and Ireland.  Each of these regions has at least two availability zones.  The failure was primarily focused on one zone within one region.  This means that everything running in other zones and other regions remained up and running during this outage and thus the majority of deployments worldwide were unaffected.  I have read a few blogs of major users that stated that they don’t use the EBS service and thus had little or no trouble during this outage.

So what is required to survive this kind of failure?  Many would say new architectures and designs are required to deal with the inherent unreliability of the cloud. I believe that customers can keep the same techniques, architectures, and designs that have been developed over the last 30 or more years, and it is one of the cornerstones of the CloudSwitch strategy.  We believe that it should be your choice on where to use new features and solutions, and where to use your traditional systems and processes, and it should be easy to blend the two. 

To that end, some of our customers are using a technique that extends their environments into the cloud; using the cloud to pick up additional load on their systems.  In this failure case, they were able to rely  on their internal systems to continue their operations.  In other cases, customers want to use their existing backup systems to create an independent copy of their critical data (either in a different region, or in their existing data center).  With the cloud, they can bring up new systems utilizing their backups, and continue with operations.  The CloudSwitch system allows them to bring up systems in different regions, or even different clouds in response to outages; our tight integration with the data center tools allows them to use their existing monitoring systems and adjust for problems encountered in the cloud through automation.

How did we do?  We’re very heavy users of the cloud, and many of our servers in Amazon were not impacted.  Of the few that were impacted by the outage, a few key systems were “switched” back to our data center, and a unfortunately a few went down.  On the servers that went down, we had decided to use Amazon’s snapshot feature as the data protection mechanism;  we felt this was sufficient for these applications, and therefore  we did not bother to run more traditional backups (or data replication).  Given what we have learned from this experience and from observing how the community dealt with this outage we will now review those decisions.  In the end, we’ll have a few more traditionally protected systems, and a few less that rely solely on the cloud providers infrastructure for data protection.

The outage from Amazon severely impacted many businesses and has caused many others to question the wisdom of clouds. The reality is that public and private clouds are a fact in the compute landscape, the only question is how do we insure that we have adequate protection? The answer lies in the experience that we have gained over the past couple of decades in building robust systems – in other words: what’s old is new.

File under: ,
0 comment(s) so far...

CloudFormation – Cool, But Who Is It For?

By John Considine

A few weeks ago Amazon released a new feature for Amazon Web Services (AWS) called CloudFormation.  This allows a user to organize the process for provisioning and operating resources in the AWS environment and is an evolution of the AWS model of “some assembly required.”  We have often viewed the features and functions within AWS as a box of parts, from which users are left to build their own creations.  This model is highly biased towards developers, the kind of people who like to have a box of parts and are willing to put in the effort to build new and interesting creations from them.

CloudFormation allows a user to coordinate a number of features within Amazon’s environment, such as: launch a set of AMIs (virtual machine image w/ application), configure a security group (pseudo firewall), setup an ELB (Amazon’s version of a web load balancer), and configure CloudWatch monitoring and alarms.  All of this can be managed from a template that describes each of these setup steps, and is written in easy-to-use JSON.

So this new feature is pretty cool, but after working with it for a while, I’ve been wondering who the target user is.  If you are a developer that is interacting with AWS through their API, then you already have a method of coordinating the resources and services in Amazon.  By definition, up to this point, you had no choice.  But more than that, if you are programming to the API, you want to have control over the details of your deployment, and to be able to monitor the steps and process.  The CloudFormation is an alternative to your current methods, but is not necessarily better – if you are using the APIs, you still have to monitor the progress and deal with faults during the CloudFormation process.

On the other side of the spectrum, there are the “enterprise-class” users who are looking for full configuration management of their deployments – they want to control the full lifecycle of their system and software deployments including change control of all of the components within the system.  The CloudFormation solution is really a provisioning engine, and even at that, it leaves off the early and late parts of provisioning – the actual configuration of the base servers, and the “customization” aspects of running in Amazon. Configuration and customization include things like creating the base images, controlling the OS configuration (kernels, boot parameters, etc.), selecting device drivers for consistent integration and operation, adjusting for randomly-changing IP addresses in Amazon, configuring load balancing based on the notion of instance ID rather than IP address, etc.  The actual construction of the application and the configuration of the OS is done outside of CloudFormation, with CloudFormation operating as a provisioning engine.

Given that the developers have the tools they need to coordinate the provisioning and the enterprises are looking for full configuration management, where does this leave the target market for CloudFormation?  Clearly the Amazon console users that are interacting with Amazon through the AWS portal are best served by this new feature.  CloudFormation gives these users a simple “portal” for provisioning and managing their cloud deployments – but it comes at the cost of programmatic access and integration with existing application lifecycle tools and processes. Console interaction drives cloud activity into its own silo, and fosters the concept of the cloud as being a separate, foreign, and independent environment.

So what does this feature mean for CloudSwitch customers?  Not much really, since our customers are looking for tight integration with their existing systems and processes, and want to have end-to-end control over their virtual hardware, operating systems, and application configuration.  While CloudFormation is designed to allow a user to coordinate a number of features and functions of AWS, the user still has to use the new and somewhat different components provided by AWS. For example: using AMIs for their VM images, limitations on the kernels, operating systems, and OS configurations, firewall and load balancing configurations that are non-standard, and behaviors in the deployment and operation that deviate from the expected behavior in the enterprise. 

In the CloudSwitch model, if a user wants to configure a firewall, they use a full-featured firewall with full configurability, not an Amazon-specific version; if a user wants to monitor their applications, they use their existing tools and processes; and if a user wants to have full configuration management of their deployments, they can control every detail of their servers virtual hardware, operating systems, networking, and applications – and not conform to the restrictions of the cloud provider.  As our customers know, CloudSwitch is about giving the enterprise full control over cloud configurations and processes, rather than coordinating the components that a cloud provider delivers.

0 comment(s) so far...

AWS and Freedom of Speech?

By John McEleney

The blogosphere and twitter have been in overdrive the past couple of days with the removal of WikiLeaks from AWS. The reaction and condemnation of Amazon has been swift and often brutal – charging the company with censorship and cowardly behavior. Consider the announcement from WikiLeaks on Twitter:

“WikiLeaks servers at Amazon ousted. Free speech the land of the free — fine our $ are now spent to employ people in Europe.”

Even the New York Times is fanning the flame by suggesting that Amazon yielded to political pressure from Senator Lieberman: “WikiLeaks’ illegal, outrageous, and reckless acts have compromised our national security and put lives at risk around the world,” Mr. Lieberman said. “No responsible company – whether American or foreign – should assist WikiLeaks in its efforts to disseminate these stolen materials.”

It’s very clear that WikiLeaks violated their terms of service; in fact Amazon posted this announcement on their AWS site:

Amazon Web Services (AWS) rents computer infrastructure on a self-service basis. AWS does not pre-screen its customers, but it does have terms of service that must be followed. WikiLeaks was not following them. There were several parts they were violating. For example, our terms of service state that “you represent and warrant that you own or otherwise control all of the rights to the content…, that use of the content you supply does not violate this policy and will not cause injury to any person or entity.” It’s clear that WikiLeaks doesn’t own or otherwise control all the rights to this classified content.

I believe the decision by Amazon was neither censorship nor cowardly. If I had to choose a word to express the action taken, I would call it consistent. It is consistent with the agreement that end users accept when they use AWS.  I applaud Amazon for taking this action. While there are valid arguments for both sides of the WikiLeaks issue, these are part of a much broader debate over the democratization of information enabled by the internet and the moral code that journalists in the print media have lived by for so many years. For Amazon, the issue is more specifically related to the nature of the WikiLeaks content.

Sometimes it is useful to examine this type of decision at a personal level. Imagine that tomorrow someone steals some of your own personal property and tries to sell it on eBay. Wouldn’t you expect eBay to respond to your request by removing the offending posting? That’s exactly what Amazon did – once alerted to the fact that WikiLeaks was using AWS to distribute material that did not belong to them, Amazon took the controversial, but proper step (consistent with their terms of usage) of discontinuing WikiLeaks’ service.

Conspiracy theorists will say that Obama and half the government called Jeff Bezos and demanded that he stop WikiLeaks or else… my guess is that the truth is that the AWS team simply looked at the material and made the decision to terminate their access because it violated their terms of usage.

This is plain and simple – no major conspiracy, no attack on freedom of speech, just consistent business practices – which is exactly what you and I should expect from a leading cloud provider.

File under:
0 comment(s) so far...

What Cloud APIs Show Us About the Emerging Cloud Market

By John Considine

While there is no “official” definition of cloud computing, I believe programmatic access to virtually unlimited network, compute, and storage resources is an essential characteristic.  Even though many users access cloud computing through consoles and third-party applications, the foundation of a cloud is a solid Application Programming Interface (API).

Since CloudSwitch works with many cloud providers, we have the opportunity to interact with a variety of cloud APIs—both active and soon-to-be-released versions.  After working closely with both the APIs and those implementing them, I’d like to share some impressions:

  1. Despite all the discussion about standards, clouds are still very different.  The important takeaway here is that cloud APIs have to cover a lot more than start/stop/delete a server, and once the API crosses into provisioning the infrastructure (network ranges, storage capacity, geography, accounts, etc.), things get more interesting.
  2. A cloud requires a very strong infrastructure to work properly.  For public clouds, the infrastructure needs to be good enough to sell to others.  If you know what to look for, key elements of the cloud API can inform you about the infrastructure, what tradeoffs the cloud provider has made, and the impact for end users (More on this later.)
  3. The cloud capabilities, and thus the APIs, are evolving fast.  We see new API calls and expansion of existing functions as cloud providers add new features and capabilities.  At the same time, we are talking with cloud providers about services that are coming soon and what form their API is likely to take.  This is a great place to leverage the experience and work of companies like CloudSwitch to integrate the new capabilities into a coherent data model, and keep up with the changes.

An API can give a good indication of what is going on inside the cloud, particularly when you look at the functions beyond simple virtual machine control.  I like to look at the network and storage APIs to understand how the cloud is built.  For instance, in Amazon, the base network design is that each virtual server receives both a public and private IP addresses.  The addresses are assigned from a pool based on where your machine ends up within their infrastructure so that the cloud provider can route network traffic to your servers.  In Amazon, the base network design gives each machine both a public and private IP address, which are assigned from a pool based on where your machine ends up within their infrastructure.  However, even though you get two IP addresses, the public one is actually just routed (or more accurately NAT’ed) to the private address.  In Amazon, you only have a single network interface to your server, which is a simple and scalable architecture for the cloud provider to support, but will cause problems for applications that require at least two NICs (like some cluster applications).

An interesting contrast to this design is found in Terremark’s cloud offering.  Like Amazon, IP addresses are defined by the provider so they can route traffic to your servers, but instead of the generic pool of addresses used by Amazon, Terremark allocates a range for your use when you first sign up.  The good side of this approach is better control of the assignment of networking addresses; the bad side is potential scaling issues since you only have a limited number of addresses to work with.  In addition, you can assign up to four NIC’s to each server in Terremark’s Enterprise cloud, which lets you create more complex network topologies and support applications that require multiple networks for proper operation.

Just when you thought this all makes sense, you have to take into account that in the Terremark model, servers only have internal addresses.  Unlike Amazon, there is no default public NAT address for each server.  Rather, Terremark has created a front-end load balancer that can be used to connect a public IP address to a specified set of servers by protocol and port.  For each protocol and port you want to connect to your server, you must first create an “Internet Service” (in Terremark language) that defines a public IP/Port/Protocol and then assign a server and port to the Service, this creating a connection.  Since this is a load balancer, you can add more than one server to each public IP/Port/Protocol group.  Now that we have opened the discussion on load balancers, I have to mention that Amazon has a load balancer function as well.  And while it is not required to connect public addresses to your cloud servers, it does support connecting multiple servers to a single public IP address.

The key point is that the APIs and the feature sets they define tell a story about the capabilities and design of a cloud infrastructure.  Decisions made at the infrastructure level—like network address allocation, virtual device support, and load balancers—will impact the end user features, flexibility, and scalability of the whole service.  When considering what cloud environment is best for your applications, you need to look down to the API level to understand how the cloud providers’ infrastructure decisions will impact your deployments.

Building a cloud is clearly complicated—but it provides an unbelievably powerful resource when it’s done right.  Cloud providers choose key components and a base architecture for their service which results in clouds with different “sweet spots”.  With CloudSwitch, you can span these different clouds and put the right application in the right environment.

0 comment(s) so far...

Holiday Presents from the Cloud

As the year winds down, there are a few things I have come to expect: holiday parties, snow, and new features from cloud providers. This year exceeded all of my expectations, starting with a note in early December from our friends at Terremark letting us know that they have fixed their Windows pricing for cloud servers. Until this upgrade, if you started a Windows server in their cloud, you had to pay for a whole month of Windows licensing ($30-$100 depending on the version) no matter how much you used the server. This was rather un-cloudlike, where we want to only pay for what we use. With this new feature, running Windows in Terremark’s cloud only costs a few cents per hour (Linux cost + 20%).

Then came the snow—I live in New Hampshire, and on December 9th we received a foot of new snow to really get the season going. The very next day, Amazon made a big flurry of announcements—support for Windows 2008, the ability to boot from EBS, and the new US region US-West1.

Each of these features means big things for Amazon and for cloud users. First, support for Windows 2008 is a longstanding request from Amazon users. I think that Amazon was held back from supporting W2K8 because of the design of their boot volumes, which needed to be copied out of S3 into the local storage instance in order to boot the operating system. As the boot volume grows, the amount of resources consumed and the boot time of the servers grows significantly, withW2K8 requiring more than 10GB by default. In order to support W2K8, Amazon required another technology advance to make it possible—booting from EBS snapshots.

Perhaps the biggest problem enterprise users had with Amazon was the lack of persistent storage for boot volumes. Amazon has now created a way for users to build persistent boot volumes, coming up to parity with competitors on this feature. Sure, it’s a little different from how enterprises normally think about storage and configure boot volumes, but the ability to use EBS volumes for booting eliminates the window for data loss that most users had to contend with in the original boot methods. (This feature is not huge for CloudSwitch customers because we have always supported booting from EBS as part of our products; however, we can take advantage of this feature to improve boot times for servers in Amazon.)

Another major Amazon announcement is the new west coast region. Many of CloudSwitch’s early customers (not to mention our own development activities) are based on the east coast, so EC2’s primary location has been a good fit for us. Things only improved with the introduction of the Europe region since we have seen a lot of interest for European resources for both locality and compliance reasons. However, for west coast customers, having to hop across the whole country to access your cloud resources was less than ideal. Now these companies have local resources to target, but more important, this ongoing expansion shows that the public cloud is doing well. The addition of US-WEST1 and the soon-to-open Asia region reflect just how quickly the public cloud is growing and how hard Amazon is driving it.

The news from Amazon comes on top of what was already an outstanding year for cloud computing with major announcements from many key players, including: IBM software running in the cloud, new VMware-based public clouds, reduced pricing for servers and storage in the cloud, and Microsoft’s Azure gaining momentum. Each of the cloud providers is growing and maturing its cloud offerings, and we are reaching a tipping point where there are multiple clouds with sufficient features to support enterprise workloads. Get ready for 2010—it’s going to be an exciting year as large-scale enterprise cloud computing takes off.

0 comment(s) so far...