Backup and disaster recovery

It is an engineer's worst nightmare — a backup plan that fails. Whether the initiating event is a complete on-air server failure, an infection of master control computers by a virus or a late-night fire that destroys the entire facility, we try hard to anticipate the worst. But sometimes backup plans fail. Here are some suggestions that will help you avoid a failed disaster recovery plan.

As you dust off your backup and disaster recovery plans, it may help to first think about the sorts of things you are trying to recover from. Here are some possibilities:

  • Loss of facilities, which could be the result of tornado, fire, hurricane, etc.
  • Loss of key equipment, including failure of servers, automation, traffic, power systems, HVAC, etc.
  • Loss of personnel.
  • Environmental disaster, such as a gas leak or chemical spill.
  • Virus or other computer attack due to infection of critical systems, Denial of Service attack on WAN feed, etc.
  • Physical attack, for example, the Discovery Networks attack earlier this year.

Boxing your recovery plans

Recovery plans can cover a broad range of issues, and these different issues may require completely different responses. This brings me to the first key point about recovery plans: Recovery plans should be developed by a multidisciplinary group with diverse skill sets, having good knowledge of the company.

As you consider different types of events that could negatively affect your facility, put a box around your recovery plans. In other words, for some facilities, being off the air is absolutely not an option. Fox Television, for example, has facilities in Los Angeles and Houston. The broadcaster has made substantial investments to ensure that its operations can continue, even in the event of a major earthquake or hurricane. Other companies may decide that, for some events, there is a point at which the costs are too great and the risks too remote to justify a recovery plan that covers the most extreme events. These companies may plan for a loss of servers, but not for a loss of the entire facility. Boxing your recovery plans is critical, but it is absolutely vital that affected departments are involved in making decisions about what is in the box and what is out. Communicating the plan and working as a team is vital. If the plan ever gets put into place, you want those affected by the plan to have participated in the decisions that will now be put into effect. When developing recovery plans, communications is critical; those affected by the plans must be involved in the process of developing them.

Recovery plans can be expensive to develop and expensive to deploy. Top management must buy into these plans, and they must participate in development of these plans, at least at a high level. In many cases, recovery plans involve basic business decisions. Top management must buy into recovery plans because they involve basic decisions about the business.

Testing the recovery plans

When thinking about recovery plans, consider what can be done as a part of normal operations to contribute to recovery activities. For example, if you have an ingest process for the on-air content, think about what it would cost to have that ingest create another copy that is stored at a remote location as part of the normal workflow. If you are thinking about building a remote news studio in another part of town, consider what it would take to do a minimal newscast entirely from that remote studio, assuming that the main studio was completely inaccessible. Remember that steps in a recovery plan that become a normal part of everyday workflows will get done. Processes that require attention outside of normal operations will get missed over time, especially processes that are automated and unattended. To the greatest extent possible, activities that are part of recovery plans should be part of normal operations.

I know of a number of companies that have put a lot of effort into recovery plans, only to find out that these plans were incomplete; the plans would not work after they were put into effect. This is actually a common problem. People develop elaborate plans, but they are hesitant to test them because there might be something wrong with the plan that would result in lost airtime during a simulated emergency. This fear is well founded; people who think hard about recovery plans quickly realize that it would be easy to miss a critical item. It can be frightening to pull that critical circuit breaker to test whether the backup systems work, but you have to do it. Otherwise, all the planning will have been wasted. That is not to say that every possible scenario must be played out; common sense must be your guide. However, if none of the recovery plans are tested, then how will you know whether they're adequate? Simulated tests must not be superficial.

Another key point is that recovery plans must be retested periodically. When I worked at Turner, we had an elaborate power distribution system that put two separate power feeds into all critical racks. But over time, someone doing maintenance would plug both power cords of a redundant power supply into the same mains source. It is critical to test recovery plans. Top management must support these tests, with the understanding that it is possible the tests will reveal a problem that affects air.

Over time, your business changes. This certainly has been the case in the media industry. When your plan was created, the part of your facility that created iPhone feeds was an experiment. This year it is a critical part of your business. Recovery plans must be reevaluated.

Thinking through all aspects of a recovery plan can be challenging. That is why it is good to get a group of people with different skill sets together and tackle this activity as a team. As an example, when you plan for a loss of power, how long do you plan for power to be off? A relatively short power outage — say one or two hours, is not really a problem. What about a power outage that is so long that you use up all the diesel fuel? Do you have a contract with a fuel delivery company? What if a significant part of the city is without power? Does your contract guarantee delivery within a certain period of time, regardless of how many other customers are calling for fuel? How hot or cold is it when the power goes out? Do your plans include running the HVAC units at full capacity? Partial capacity? At partial capacity, what equipment would you have to turn off in order to keep your facility from overheating? Is this enough to keep you on the air?

If you back up all the content to a remote facility, are the software systems also backed up so that you have the metadata at the remote location to find the content on a server? Involving people from many different departments in your facility will help you to think more completely about all aspects of your recovery plan and may help you avoid missing something critical.

Conclusion

I want to leave you with one last thought. A great German military leader, Helmuth von Moltke, said, “No plan survives first contact with the enemy.” We all know this to be true. But this should not keep us from planning. Many aspects of a recovery plan will work perfectly. But since we can anticipate that some aspects of the recovery plan will not work, remaining agile and flexible in our recovery planning will be critical to success. Recovery plans must be well thought out, but they should also be flexible and adaptable to allow for unforeseen events.

Brad Gilmer is president of Gilmer & Associates and executive director of the Advanced Media Workflow Association.

Send questions and comments to:brad.gilmer@penton.com