Storage and redundancy

Today, 2TB hard disks sell for less than $100, and they come with USB 2 or USB 3 connections. You can plug them into your system, and they just work. It sounds great, but what happens when you fill that 2TB drive up? With just one drive failure, you can lose a lot of data. (See Figure 1.)

With legacy NTSC (uncompressed), that could equal 26 hours of material. Or, if it’s DVCPRO HD at 720p, that would equal more than 37 hours. So, how often do these drives fail and what can you do?

Drive failures

Most of us experience a hard drive failure as the disk failing to spin or it making unusual noises after which the computer can’t see the drive, but there are other types of failure. Remember drop outs on video tape? Drop outs occurred when the magnetic oxide on the tape came off and there’s no information on that section of tape to be recorded or played back on. VTRs had built-in drop out compensators to try to cover up the lost information, but this was all analog, so there was no fancy algorithms to recover lost data; the VTR could only copy nearby information to cover up what was missing.

This type of error can happen to hard disks as well. They may lose their oxide or dirt gets in the way of the head. Sometimes hard disks cannot read a certain sector on a platter. When this happens, whatever data that was there is lost unless the error control and correction (ECC) can recover the data. ECC is similar to forward error correction (FEC) as used in satellite communications as well as our own ATSC terrestrial broadcast system. If the data cannot be recovered, it’s then labeled as an unrecoverable read error (URE). In reality, the ECC written to the disk is just a check sum used to ensure that the data retrieved is good. The size of the sector determines the amount of data lost: The bigger the drive, the bigger the sectors are. The ECC takes place within the hard drive, which then informs the operating system of the URE. At this point, if the OS cannot reconstruct the missing data (most are not designed for this job), it is lost and the file is corrupted.

A URE occurs about once every 10-14 bits read. That equals about once every 12TB, six of those 2TB drives or four 3TB drives. And, when storing video, 12TB is really not that large. Drive manufacturers today are moving from the traditional 512-bit sector size to the much larger 4K-bit sector size for the large hard disks. This means that if you encounter a URE, even if you have a way to recover from it, you have lost that much more data. One reason manufacturers moved toward the 4K sector size to get around the small disk-addressing capability of an older OS such as Windows XP. They will only address about 2.9TB of data on a disk. By making the sectors larger, the drive manufacturers can fool the OS into thinking it’s working with smaller drives.

Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) is a system for hard drives to report back to the disk controller about the health of the drive. Using the data from S.M.A.R.T.-enabled drives, systems can know more about the drives attached to it, like temperatures, number of read errors (even corrected ones) as well as many other parameters. With this data, there has been some success in predicting drive failure, but failure is often sudden and mechanical in nature.

Data redundancy

The only way to protect your data is to store it on more than one hard disk, and this can be done in several ways. An expensive but reliable way to keep redundant copies of your data is to use a RAID storage system. Most engineers are familiar with RAID and its various levels, but the one most identified with data reliability is RAID 5. In this configuration, the data is spread out across all the drives in the array in a way that provides enough redundancy so if one drive fails, all of its data can be retrieved from the other dives. But if a URE is encountered while the data is being read to rebuild the data, to fill the new drive, all the data from the RAID is now compromised and the rebuild will stop. (See Figure 2.)

As soon as that drive failed, the RAID 5 system was still working; it was just slowed down and vulnerable to a second drive failure. Now that the rebuild has halted due to the URE, all data needs to be transferred to another storage system until the system can be restored to full capacity. The probability of a URE after a single-drive failure has increased with the use of increasingly larger hard disks.

RAID 6 is another option that can tolerate the loss of two disk drives and not lose any data. But as soon as it loses one drive, it becomes essentially a RAID 5, with all its vulnerabilities, until the failed drive is replaced.

Mirrored dives are another way to obtain redundancy, but they require a doubling of the number of hard drives and, thus, cost.

Of course there are backup systems, software and drives, but these are essentially offline systems and are not helpful when you are running up against a deadline and need the data immediately.

Classes of hard disks

Two different classes of hard disks are out on the market today. The first is known as desktop-class, which is the variety that you or I would buy at our local electronics retail outlet. The second is enterprise-class, which is what the likes of Google, Microsoft and any other large, data-dependant company uses. They both use platters, motors and heads, but there are other differences besides price. (See Figure 3.)

With desktop drives, manufacturers expect that they will only be used a few hours a day for a single user, on a single computer. If the drive fails, only one user and one computer are affected, thus the reliability requirements are lower. Enterprise drives, however, are expected to be part of a server system and/or RAID attached to a server. This hard disk would be expected to server many clients on the network, and its failure would affect both the operators and the operating systems of the computers attached to the network. The consequences of an enterprise drive failure are more widespread than that of a drive attached to a single computer.

Because of these different expectations the enterprise-class drive, mechanical parts are manufactured to a much higher tolerance than their desktop cousins. Even the actuators and motors are all designed for much heavier use because the drive is expected to work 24/7.

The higher expectations of enterprise-class drives also affect their cost. The cost of a desktop-class drive is often less to keep it accessible to the average computer user, whereas the enterprise-class drive’s cost is commensurate to its reliability. The mean time between failure is about 700,000 hours for desktop drives and about 1.2 million hours for enterprise, an extra factor in its cost.

Another difference is how the ECC is used. In a desktop-class hard disk, the ECC starts and stops with the checksum written to the disk, but the enterprise hard disk uses ECC all the way through the complete hard disk subsystem. This means enterprise disks check the integrity of the data when it enters the drive from the host through the internal memory or buffer within the drive as data passes both ways through it. This ensures a higher degree of data reliability.

Additionally, they each handle bad sectors differently. Because a desktop drive is expected to be the only one, and to have the only copy of the data being retrieved, the drive will make multiple attempts to read a bad sector if it’s encountered. This takes time, which the Enterprise does not have with its many clients waiting for their data. The enterprise-class drive will mark that sector as bad the first time around knowing that there is a redundant copy of it on the RAID or mirrored storage system. It will not try to reread the sector, thus speeding up performance, and then it will make sure that system replaces that sector’s data from the redundant copy onto another sector.

Conclusion

This is just an introduction to storage and redundancy in this digital age. So, if you have an entire IT department or you are the IT department, you should be familiar with this essential aspect of modern broadcasting.

Authors note

This marks my 72nd “Transition to Digital” newsletter and my final one. I leave hoping that I have helped to educate the broadcast engineering community on a range of topics, many of which I learned more about as I wrote. Thanks for reading.