Building disaster-resistant computer networks

Building disaster-resistant computer networks with fault tolerance and high availability.
Publish date:
Updated on

For many years, the most important technology layer in a broadcast plant was video and audio routing: Lose your router and you were off the air. Over the years, computer networks have become just as critical. At first, networks were critical because they carried information (automation, traffic, etc.) about programs being broadcast. With the advent of AAF and MXF, networks are poised to become a major part of the content transfer infrastructure within a facility. The fact is, in many facilities today, if you lose your network, you will be off the air. This means it is important to focus on how to keep your network up and running in the event of a disaster.

Crucial issues to consider in designing networks are fault tolerance and high availability. Because you are using the same network to service a number of clients, a failure in the network can impact the entire operation. For years, I thought the only answer to the possibility of system failure was to design systems to be fault tolerant. Fault-tolerant systems are designed to be resistant to faults. Typically, a single fault will not cause a total system failure. Fault-tolerant designs usually include dual power supplies, redundant disks and automatic changeover software. Systems of this type are designed as a single unit or set of interconnected units; they are sold as a system; and they may be quite expensive. Many are designed so that the only way you know there has been a failure is by checking status monitoring and alarms. For example, if you lose a disk in a RAID array, you may not know there has been a failure.

Figure 1. Multiple connections between a server and a switch not only provide backup in case of NIC failure, but also allow users to build “fat pipes” to heavily accessed network equipment. Click here to see an enlarged diagram.

Another approach, which can be much more economical but may or may not provide the same protection from failure, is high availability. With high availability, the point is not to prevent failures. Instead, a designer uses off-the-shelf components to design a system such that a single failure has little impact or causes a minimal outage. For example, a high availability design might incorporate two completely separate Ethernet systems. The servers and clients might have two Ethernet cards in them instead of one. High availability typically takes advantage of the low price of consumer computer hardware. It might seem cumbersome to put together two completely separate Ethernet networks, but Ethernet is practically free these days unless you are talking about high-speed technology.

High-availability systems may have a higher fault rate than fault-tolerant systems, although this depends entirely on decisions made by the system designers. The bottom line? Fault-tolerant systems may indeed be more “fault tolerant” than high-availability systems, but there is a cost associated. It is up to the user to decide if fault tolerance is worth the expense.

Keep a backup

Figure 2. Connecting NIC cards to two different switches protects against a switch failure. Click here to see an enlarged diagram.

The first thing you may want to do is invest in backup hardware that can be put into operation in the event of a major failure. It is important to note that this technology moves fast. It is best to buy the minimum number of extra switches that will do, since next year there will always be a newer, faster technology available. In a typical facility, you'll need the switch you put on the shelf before the year is out anyway. The point is to have a spare available — just as you would have a spare klystron or VTR head wheel. Next, you should consider having a spare server available to be pressed into service on a moment's notice. In one facility where I worked, we planned that if the server went down, we would use a desktop unit as backup. We had the software loaded on a spare hard disk ready to go. One day the server crashed. We pulled the workstation out of an office in the engineering shop, installed the hard disk, and had the new “server” up and running in about five minutes.

You also should consider physically separating critical equipment, if possible. For example, if you have multiple T1 or DSL lines coming into your facility, make sure that at least one of those lines comes onto your property from a different direction. Backhoe fades are more common than you would expect. If you have multiple servers on your network, try to locate the servers in different spots in your building. Keep your tape or CD-ROM backup unit in a different part of the building from the devices it is backing up.

Some of you may need to recover from network outages more quickly than you can install a spare switch — say on the order of one to six seconds. In this case, you will need to look for active solutions. Both open and proprietary solutions are available that will provide failover in case of a network media (wire or fiber), switch or network interface card (NIC) failure.

Hardware-based solutions typically involve NICs and switches. In some cases, the manufacturer allows a server to be connected to a hub through multiple NICs using multiple ports on the switch. In case of NIC failure, the other cards automatically take up the load. Not only does this solution provide protection from failure, it also allows users to aggregate bandwidth across multiple connections, providing a “fat pipe” on to the network for heavily used servers (see Figure 1). Note that this solution does not help if the switch fails.

Figure 3. Having multiple connections between switches provides redundancy in the case of the failure of a connector, wire or fiber optic cable. Click here to see an enlarged diagram.

Another solution is to connect the server to two different Ethernet switches. In this configuration, the goal is to protect the system from a switch failure. Both switches are connected to the same network. In case of a switch failure, traffic automatically is routed to the remaining switch. (See Figure 2.)

In some cases, networks — especially networks built to handle broadcast content — must handle heavy traffic. Some manufacturers enable users to build “fat pipes” between switches using multiple connections. This not only provides the user with redundancy, but also allows them to increase the speed of their networks. If a cable between the two switches fails, the switches will automatically reroute traffic to the remaining ports.

Some manufacturers carry this arrangement to its logical next step. They provide switches with multiple redundant physical connections to each port. Should a port fail, the switch changes to a backup port and media (see Figure 3).

Redundant routing

It is relatively easy to keep a spare Ethernet switch on the shelf. It takes more work to keep a backup server at the ready, but there are numerous options available — from ghosting server drives to clusters. But there is another area in large networks that requires some creative thinking. If your network is sufficiently large, you already may have deployed a router. I am not talking about the small DSL or T1 routers frequently deployed as edge devices to connect to the Internet. I am talking about more full-featured routers that are typically used in Intranets to segment traffic in different departments, provide network address translation and port address translation, execute complex firewall rules, and allow tight control of access to critical on-air operations. In many cases, these routers are actually active computer devices rather than dedicated single-board computers. The routers can have complex configuration files and they build sophisticated tables as they learn about your network. When one of these systems fails, it is not as simple as grabbing a spare box off the shelf. Routers are dynamic boxes with configurations and tables that change in real time. This makes recovery from a failure challenging to say the least. Router manufacturers understand that in some cases, failure is not an option. So they have developed a number of proprietary technologies that offer hot standby and load-balanced configurations.

In a hot standby configuration, the main and backup routers are in constant communication. The backup router is being kept up-to-date in near real time. If the main router fails, the backup automatically and almost instantaneously switches online. In load-balancing configurations, there is more than one router in the system. The load is distributed among multiple routers according to user-configured parameters. The system is designed so that there is enough spare capacity that, should a router fail, the others immediately and almost instantaneously take over the load. These solutions are not cheap, and installation and configuration is non-trivial. However, if your network needs to remain up no matter what, it might be worth the investment.

Brad Gilmer is president of Gilmer & Associates, executive director of the AAF Association and executive director of the Video Services Forum.

Send questions and comments

Home | Back to the top | Write us