5 Things To Do Before the Next AWS Outage

2 minute read

I am a firm believer in cloud computing. In fact, I’m focusing my business on the fact that cloud will be a standard part of software and IT architecture in the coming years. The recent issues with the Northern Virginia AWS data center highlight the fact that the cloud is still in the process of maturing and requires thoughtful architecture and engineering practices to deal with outages. Rather than taking this opportunity to bash AWS, or the cloud in general, I think it is time to review the fundamentals of software and IT architecture and why the cloud is important.

Life Before Cloud Computing

In my recent talks with company founders, I constantly come up against the old myth of the cloud: “Isn’t the cloud just about hosting?”

Well, it is, but it takes it to a new extreme. Remember, it was only a few years ago that you were often faced with a decision: shared hosting, dedicated hosting, or co-locate your own servers. While there were multiple data center options, all were tied to specific hardware that you purchased or leased. If that data center went down, you couldn’t move your hardware to a new data center. Instead, you had to wait until the repairs were completed.

With the addition of virtualization and API-based provisioning, the cloud opens up the option to spin up new resources in the data centers of your choice. It even gives you the option of moving to a new vendor, if necessary.

Cloud Computing Is the Means, Not the End

The story of the cloud doesn’t stop at API-based provisioning of virtual resources. Instead, it becomes a more elastic foundation for your infrastructure. Though work still has to be performed to make it happen, cloud computing makes disaster recovery easier, but not automatic (at least not yet).

Because the cloud allows for the provisioning of virtual hardware and services, more automation can be applied to reduce downtime. However, IaaS (and even PaaS) vendors cannot understand how your application works and what to do in case of a failure. You do. This is where disaster recovery planning must take place.

5 Things To Do Before the Next AWS Outage

  1. Build a disaster recovery plan for your organization. It may be one plan, or a series of plans for each of your applications.
  2. Review your systems architecture to determine what needs to be done to execute your disaster recovery plan. Dependencies upon IaaS vendors, third-party APIs, and third-party SaaS vendors are all part of this architecture and must be reviewed.
  3. Review your cloud provider’s architecture to determine possible single points of failure (SPOF). Heroku is built on AWS and (today) resides only in one region and therefore is subject to major outages. My hope is that they offer a multi-region option for applications, but they don’t exist today.
  4. Automate recovery processes whenever possible, to reduce the need to have lots of manual steps. The last thing you want to introduce during a crisis is manual processes that become a breeding ground for mistakes.
  5. Practice recovery techniques before they happen. If you have ever been in line at a retail store when the computers go down, you will immediately know if they have practiced how to perform a manual checkout process. Apply the same principles to your staff, including backup team members and new staff.

Cloud or no cloud, both humans and machines fail. Prepare for it. Plan for it. Practice for it. But don’t ignore it.