Amazon’s cloud outages causing concern

Virtually every day, cloud computing technology is helping new businesses launch and established media organizations expand in cost-effective ways not possible five years ago. But, it’s not without its technical issues, and that’s what has many broadcasters and content distributors concerned. When the remotely operated service goes down, revenue (and sometimes valuable customer data) is lost. In the most recent example, Netflix blamed a recent failure of its Internet streaming service on Amazon’s Web Services, a major remote signal processing and online streaming service based in Northern Virginia.

The outage affected Netflix customers across the United States, Canada and Latin America. It began at 3:30 p.m. Eastern time on Christmas Eve and lasted for some users into Christmas Day. It was at least the third major outage for the company during 2012.

The cause of the failure, according to published reports, was a shutdown of several Elastic Load Balancers (ELB) that distribute network traffic to Netflix customers to support online streaming. While the ELBs serving Mac and PC streaming were unaffected, those users experienced latency issues and may have needed to reload a stream.

“Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that your Web browser or streaming device calls,” wrote Adrian Cockcroft of Netflix in a blog posting explaining the outage. “Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar devices tend to depend on specific ELBs.

“Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application. Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass requests to the servers behind them. None of the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.”

Amazon reported the failure was because “... data was deleted by a maintenance process that was inadvertently run against the production ELB state data.” This caused data to be lost in the ELB service back end, which in turn caused the outage of a number of ELBs in the US-East region across all availability zones.”

Netflix said it is still in the early days for cloud computing, and there is certainly more to do in terms of building resiliency into the system.

“In 2012, we started to investigate running Netflix in more than one AWS region and got a better gauge on the complexity and investment needed to make these changes,” Cockcroft wrote. “We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable. As always, we are hiring the best engineers we can find to work on these problems, and are open sourcing the solutions we develop as part of our platform.”

Amazon said it has made a number of changes to protect the ELB service from similar disruptions in the future.