RAID for broadcast

Figure 1. The vulnerable period begins when a drive in a RAID fails and lasts until the new drive is installed, rebuilt and back online. Click here to see an enlarged diagram.

By now, most broadcasters have heard the term RAID. RAID stands for redundant array of inexpensive disks. Typically, RAID systems work by storing a little bit of extra information called parity bits along with the regular data. If a disk fails, these parity bits allow the array to rebuild data.

This month we will focus on key features and considerations for RAIDs used in the broadcast and post-production environments.

Easy access to drives

There are a number of features that are important in generic RAID applications. The following are a list of features and considerations that are especially important for broadcasters who are contemplating using RAID in their facilities for video and audio content.

Hot-swappable drives

All of us have seen RAID arrays that have large numbers of drives right at the front of the storage device. But I have seen some RAID implementations where the drives were buried deep inside the storage system. I would argue that this design negates the purpose of having the RAID in the first place. If you have to shut the system down entirely, take the equipment out of the rack and unscrew a lot of hardware to get to the drives, then you might as well have a single disk.

Many RAID arrays are designed so that drives can be replaced while the unit is still in service. If the objective of having a RAID solution is to avoid downtime, then this feature can be key.

Rebuild in background

Hot-swappable drives have special connectors that are designed to break and make connections in a specific order so that components are not destroyed as the drive is removed or installed. Many people think that hot swappable also means that the drive will automatically rebuild (see next point); however, this is not necessarily the case.

This feature allows the data on a drive to be reconstructed from the data on other good drives in the array. As noted above, RAID arrays can recreate missing data on the fly so the application never notices a drive has failed.

Online spare

RAID arrays can also put this capability to work to recreate the data and write it to a new drive once it is installed in the array. Of course, you would expect this functionality to be available. But be careful. You may or may not be able to rebuild the data on the new drive without having to shut down the array. This used to be an exotic feature, but it is becoming much more common. If this functionality is important to you, be sure to ask for it.

SNMP and other remote monitoring

When a drive fails, the extra drive can be put online — in some cases, automatically. This can be a valuable option for critical RAID arrays. Remember, however, that the drive will still have to be rebuilt. Because there is no way to know in advance which drive will fail, it is impossible to have the extra drive ready to go at a moment's notice.

This is one of my favorite topics when it comes to RAID arrays, and it is one I would encourage you to think about carefully. If one of the characteristics of a RAID array is to keep on working even when a drive fails, how will you know when a drive fails? If you lose a drive in a RAID array and then lose another drive in the same array before the first drive is replaced and rebuilt, will the RAID array keep working?

The answer to the first question is that you will not have any notice that a drive has failed if you are not monitoring the RAID's status. The answer to the second question is that the RAID will not keep working if a second drive fails.

It is extremely important that you monitor the health of the RAID array on a regular basis. In some cases, RAID monitoring is built into the application. If a drive quits, the application lets the operator know. But many times, especially when using generic (non-broadcast) applications, there is no notification to the user that a drive has failed. With that said, every RAID array I have ever seen has provisions for monitoring drive status, and almost all of them have remote monitoring provisions.

Be sure to incorporate monitoring of RAID arrays into your overall maintenance system. If you fail to do this, you are wasting the value of RAID. Eventually, two drives will fail. When they do, you will be in the same position you would be in with a single large drive.

Dual power supplies and redundant cooling

As Figure 1 on page 32 shows, the vulnerable period begins when a drive in an array fails and lasts until the drive has been replaced, rebuilt and is back online. If, during this vulnerable period, another drive fails, then the RAID array will fail and data will not be available. If you are not monitoring the health of your RAID systems on a regular basis, the time from drive failure to replacement may become exceedingly long, exposing you to a potential complete RAID failure.

Dual power supplies and redundant cooling are quite common in RAID arrays. If you have an important service that you want to protect with a RAID array, be sure your array has these features.

Redundant controllers

The array should continue to work with a failed cooling fan. I have seen a number of RAID arrays that incorporate fan speeds and temperature alarms into their remote monitoring facilities. Because drives can fail when they get too hot, it can be important to have remote monitoring on cooling systems.

If you are looking to eliminate downtime, then you should look at the entire system, not just at the disks. RAIDs were developed to provide redundancy for rotating storage media, and they perform this task well (especially when vigilantly monitored).

Rebuild time

However, electrical components do fail, and a RAID array connected to a single controller will not protect you from a controller failure. The disk controller is the electrical interface between the computer and the storage unit. In many higher-end systems, you can purchase a second controller as an option. If one of the controllers quits, the other takes over. Occasionally, you will see this described in literature as dual redundant controllers, but this seems a little redundant to me.

It's important to know how long it takes your system to rebuild after a drive replacement. The more time it takes to rebuild and place a new drive online, the longer you're exposed.

Testing

If during the rebuild time you lose another drive, the entire RAID system will fail. While the chances of experiencing two drive failures within a few hours of each other are remote, it is important to know that during the rebuild time, your system is vulnerable.

The best way to know for sure that your RAID system supports the features you expect is to test it. Assuming your system supports hot swapping and rebuilding in background, try the following:

Pull out a drive while the system is in operation (hopefully during commissioning tests, and not when it is actually on-air!). Does the system keep running normally? Do you see any indication that the drive has failed either in the application running on the RAID or on your monitoring system?
Plug the drive back in. Do you see any indication that the drive has been installed either in the application running on the array or on your monitoring system? Does the drive begin rebuilding by itself, or do you have to do something to initiate a rebuild? How long does it take before the drive is online again and ready to use?

RAID systems are great examples of how broadcasters can leverage IT technology to create more reliable platforms for their applications. But it is important to understand the features available and how they work to be sure that you get the expected performance from your array.

Brad Gilmer is president of Gilmer & Associates, executive director of the AAF Association and executive director of the Video Services Forum.

Send questions and comments to:brad_gilmer@primediabusiness.com