Data storage myths

Data storage reliability

The last “Transition to Digital” covered how data storage is quickly becoming the key element in the broadcast facility. Soon, most archive videotapes will be converted to digital media and catch up with current productions being recorded, edited and stored in digital form. As broadcasters rely more and more on digital storage, it becomes increasingly important for the broadcast engineer to understand the benefits and the limitations of modern magnetic disk and tape storage. RAID (redundant array of independent disks) storage systems are in widespread use today, and RAID 5 especially has become the de facto standard for data storage and reliability; but as with all systems, it has its weaknesses.

RAID 5 protection

Take a group of hard disk drives and write data across them (striping) and then add parity information with which the data can be reconstructed if the data is lost. By distributing the data and the parity information across all the drives, there is a level of redundancy built in, because if one hard disk drive fails, all the data on it can be reconstructed from the remaining data by using the parity information.

The key to RAID 5’s redundancy is that the parity information is never stored on the same disk drive as the data stripe it is protecting. The parity is “distributed” across all the disk drives in the array. Once a disk drive has failed, it must be quickly replaced — having a spare hard disk drive on hand is the best plan (some systems incorporate a spare drive into the array). Once the spare is in place, the reconstruction of the data can begin; this is a processor-intensive operation because many calculations must be made to re-create the lost data. Rebuilding the data onto the new disk drive can take several hours.

RAID 5 is used where data must be available 24/7, such as video server storage or nearline storage. During data reconstruction, performance is degraded, which will manifest as slower read times, because whenever data that was lost is called for, it must be reconstructed on the fly.

What they don’t talk about

RAID 5 is a great system that has worked well in many systems for years, but there is a possible fault that is rarely talked about: read errors. Modern hard disk drives are specified for unrecoverable read error rate (URE) — this specification is stated in relation to the number of bits read. With newer disks now available in 1TB, this makes the possibility of a URE much more probable, and what happens when a URE is encountered during the rebuilding of a replaced RAID 5 hard disk drive? Most systems will simply stop the rebuild operation, possibly losing all the data stored on the RAID system. Remember that the parity information to correct the read error on the other disks may be stored on the failed disk drive, which means that it is not available to the RAID system to correct the error.

This is a main reason that high-reliability storage systems use much smaller hard disk drives, while most companies and individuals purchase larger disk drives. The smaller the amount of data stored on a disk drive, the smaller the chance for a URE. Next year, it is expected that there will be 2TB hard disk drives available. By some calculations, a 14TB RAID 5 made up of 2TB disk drives has a 50 percent chance of a URE, based on the URE rate of one per 10<14 data bits, which is what the manufacturers specify for most of the hard disk drives out there (10<14 is about 12TB).

As the size of hard disk drives increases, the rate of URE remain the same, increasing the chances of a URE right along with the size of the disk drive. RAID 6 was developed to handle the failure of two disk drives by storing two sets of parity information in different parts of the RAID, but now it looks like RAID 6 will be needed to counter the effects of URE as the size of disk drives increases.

Of course, if a RAID 5 or 6 encounters a URE, causing a rebuild to fail, the entire RAID can be rebuilt using the backup copy that was recently made — if there is a backup. Although all disk drive storage systems should be backed up on a regular basis, few are because of the associated cost and time.

Hard disk drive reliability

Last year, Google published a paper that documented the failure rates of disk drives currently in use. The study covers more than 100,000 disk drives from almost every manufacturer. And while the results weren’t published for individual manufacturers, it does give some insight into the how and why disk drives fail. The first statistic to come out has been known for years: If a disk drive is going to fail, it is most likely to happen in the first year of use, if it sees a lot of usage. With moderate to low usage, there is no difference in failure rates.

A surprising result of Google’s study was the effect of temperature on disk drive failure. It seems that the colder a drive is, the more likely it is to fail. To put it another way, the lowest disk drive failures occurred at temperatures of 95-105 degrees Fahrenheit. So cooler is not better for disk drives.

Today, many disk drives come with what is called SMART (Self-Monitoring, Analysis and Reporting Technology), technology that allows the disk drive to communicate with the drive controller and inform it of the health of the disk drive. Many different parameters are monitored and reported, but one particular feature of this system lets the system detect disk drives that are failing, but have not yet failed, and to alert the user so he can replace the drive before a complete failure occurs. What the Google study found out was that SMART only caught about 50 percent of the disk drives before they failed. This means SMART is not a very good predictor of disk drive failure, and engineers should always keep a spare disk drive on hand.

These are just the highlights of the study. For more of the Google study, visit research.google.com/archive/disk_failures.pdf.

What all this means is that as broadcasting moves into the digital age, engineers are still needed to understand and monitor the technology that allows sound and pictures to flow out from the transmitter, over cable and, now, over the Internet.