Troubleshooting RAIDs

Data storage is an ever-growing part of broadcasting, and RAID (redundant array of independent disks) systems are an important aspect of storage. From video servers to NLE (nonlinear editing) systems, RAID systems provide for high-speed transfers as well as failure protection, when the correct RAID level is used. RAID systems come is many different versions for a variety of applications. This tutorial will address levels of RAID protection where data is preserved even with the loss of one or more hard drives.

RAID protection

RAID 1 uses two (or more) hard disk drives where all the data stored on one is copied, or mirrored, on the second drive. This provides complete protection but requires double the storage space.

RAID 5 stores data across several disks, called striping, and keeps redundant data for each drive stored across all the other drives. If a drive is lost and a new one is inserted, the lost drive’s data can be rebuilt using the data stored on the other drives. In this way, no data is lost when there is a failure of a single disk drive. But if a second drive should fail before the first one is rebuilt, then all the data stored on the RAID will be lost. The advantage of RAID 5 is that it only requires one extra drive beyond what is needed for the storage size required. A 2TB RAID 5 made up of 500GB hard disks will require five hard disks for a total of 2.5TB.

RAID 10 is a combination of RAID 1 and 5 where a pair of drives mirror each other, and a series of these pairs makes up a RAID 5. In this case, half of the drives could fail, and the system would not lose any data as long as both drives in a pair are not lost. This configuration adds fault tolerance and speeds data transfers.

Less common RAID configurations are also in use, but they essentially share the same goal of fault tolerance, whereby one or more disk drives can fail but no data is lost, and high transfer speeds. As long as a failed hard disk is replaced quickly, the chance of a second one failing is very small. But the longer the wait to replace the failed disk drive, the greater the risk of losing all the data on the RAID system.

NAS systems

One of the many storage solutions in use today is called a network attached storage (NAS) device. NAS devices are used for both common office network storage as well as video file and video server storage. NAS systems are easy to use and setup because they connect through the network and are usually controlled using a browser on another computer. NAS systems are self-contained computers that are designed for speed and reliable data transfers. NAS systems are very reliable, but as with any equipment, they can fail. And if an operator lacks familiarity with how they work, it can be very difficult to repair them.

At one facility where a NAS was connected to the video server, a hard drive failed in the RAID system of the NAS, and the engineer thought it would be a simple job of swapping out drives. Not having one on hand, he placed an order with the NAS manufacture and promptly installed it when it arrived. But the RAID system would not rebuild with the new drive, so another one was sent by the manufacture — it too suffered the same fate. The system recognized the new hard drive but would not use it. It was determined that the culprit was the RAID controller card; the manufacture was contacted and a new card was sent, but it was the wrong type. Finally, the engineer received the correct RAID controller card and installed it. It too would not use the new hard drive. Now, the NAS manufacturer and the RAID controller card company were stumped. After two weeks, a solution was worked out, and the RAID was rebuilt, but only after three different hard drives and three different RAID controller cards had been tried. The solution was in knowing how RAID systems work and how they are structured.

Inside a RAID system

When a RAID system is initialized, it writes metadata to Block 1 on all the disks; this identifies it as part of the RAID and to which part of the array the particular drive belongs to. Because there may be different RAID systems, each RAID controller manufacturer has its own data structure for this metadata, and even then it sometimes changes the structure between product lines. What this means is that if a broadcaster has two RAID systems, and the RAID controller is from the same manufacturer, if the drives are swapped between them (by accident of course), the controllers will read the metadata and will not start rebuilding the array with this drive because it knows it’s part of a different array.

In the aforementioned example, both replacement disk drives had metadata on them, probably from quality control testing. If this metadata had come from a different RAID controller manufacturer, the NAS would not have recognized it and would have ignored it. But because it did recognize the metadata, it knew the disk drive had come from a different RAID array and would not write over it. Even formatting the disk drive would not have fixed this because formatting does not affect Block 1. The solution was to pay attention to what the system was telling the engineer.

In the RAID controller’s BIOS, there was a screen informing the engineer that there were two arrays, one was “degraded” (array 0) and the other was “unusable” (array 1). What the engineer needed to do was delete the extra array (array 1 was unusable because it consisted of only one disk drive, the new one), and then the new disk drive would become visible and could be rebuilt as part of the RAID array.

Television engineering has become increasingly complex, and where larger stations with their own IT departments may have an easier time with these sorts of problems, many stations rely on their engineering departments for all technical needs and solutions. This is one more reason engineers need to keep pace with the evolving technology of modern TV broadcasting.

Next time

The next “Transition to Digital” tutorial will continue the discussion about storage systems.