Tumblr lk8lb59XzI1qcvf6m

The Amazon Web Services outage took our website down for almost two days before we were able to migrate Wattpad’s content, user libraries and infrastructure to new servers.

During the outage, we heard very little from Amazon, a lot from users and even more from online media. Some of the loudest voiced questions about the validity of cloud providers for business owners.

Are the conveniences of the cloud worth the risk of possible outages? In short, Yes.

Wattpad is a young business that is expanding by 300% a year. For us and other new start ups, hosting our own servers would distract our team from core competencies, causing our engineers to spend too much time managing hardware rather than innovating our product.

As we require more space at often-unpredictable times, cloud servers provide flexible scalability at a value that many start-ups cannot beat on site.

As some journalists suggested, perhaps this is not a question of whether we should move away from the cloud but whether we were managing our portion of the cloud correctly. We will be looking at expanding our recovery plans as we continue to grow and possibly introduce a hybrid on/off site approach for critical data. But we think the benefits of the cloud outweigh some of their possible growing pains.

The Easter Weekend outage also had some unintended benefits:

In two days, we increased our Twitter followers and Facebook fans by one third. There was so much traffic on these channels, it was impossible to keep status updates near the top, prompting our users to retweet our posts and update users who had not heard about the outage.

Off of our channels, we saw hundreds of Wattpad users complain to Amazon’s Facebook page and Twitter feed, encouraging a quick response to our CEO’s requests for information.

Most interesting was how quickly users rallied together to entertain one another while our engineers rebuilt the site, inspiring one user to post on Facebook that they almost missed the camaraderie, funny videos and impromptu games other users were posting. Another reader commented that it was like an extended power failure, with everyone camping out until the site and apps were back to normal.

And like a power outage, the relief of being up and running again outweighed the novelty; but for just a few days, it really wasn’t that bad.

Via the Wattpad Watt’s Up blog

1 COMMENT

  1. Our site was also caught in the outage, but our guys had us moved to a new availability zone by that evening and things fully restored by the next morning.

    What people seem to be forgetting is that this happens all the time with self-managed data centers or 3rd-party data centers. This is the first outage we’ve had in the 2 years we’ve been in Amazon’s cloud. I certainly can’t say that about any other place I’ve worked.

    The only way to truly protect yourself is to have a disaster recovery site in the cloud with a different provider on the opposite side of the country (or the world). That costs money and you have to maintain it exactly as you do your main site, which uses more resources. We don’t have a life-or-death business so the low risks at Amazon are good enough for us.

    Amazon is not perfect however. They certainly need to improve their cloud service CS up to the same excellent level they have for Kindle CS. Communication was poor, they didn’t update the status page until 40 minutes after we reported the issue. They also need to find out exactly what happened. Those four availability zones were supposed to be physically separate from each other, how were they connected so that a network failure in one took down the other three?

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.