Flash storage ensures reliability

Disk reliability has long been an issue for broadcasters. When disk failures occur in play-to-air servers — the most critical part of the on-air infrastructure — it can mean going “black to air” for millions of viewers. Even with the redundancy of mirrored or parity-protected configurations, broadcast engineers must still wait for disks to rebuild while hoping the rest of the system stays intact. Now, in a new study by Carnegie Mellon University, “Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?,” researchers Bianca Schroeder and Garth A. Gibson confirm what broadcasters have suspected: Disk drives fail at rates six times higher than those reported by vendors.

Things will only get worse with HD. Moving from 15Mb/s SD to 50Mb/s HD, the same TV show will take up to three times more storage capacity. Three times as many disks will make it three times more likely that a disk-based server will fail.

To reduce failure rates, vendors have tried replacing disk with solid-state solutions, i.e., devices with no moving parts, which is where almost all disk failures occur. Flash is an obvious candidate because, like disk, it holds data after the power goes off. Flash also consumes less power than disk, produces less heat and is quiet by comparison.

This is the logic behind flash-assisted storage technology, server clustering with flash modules replacing disk drives on all play-to-air servers. Due to unique data striping, over all nodes in a cluster and all modules in a node, both I/O and reliability are extremely high. The solution's managed reads and writes optimize performance and avoid write hot spots that can exhaust flash prematurely. Flash-assisted storage is also economical because it combines a disk-based nearline cluster for ingest with a flash-based play-to-air cluster for high availability.

Disk's inherent risk

Broadcasters' demands for a better way to store content follow years of dissatisfaction with what many see as high disk failure rates. The Carnegie Mellon University study has confirmed that disks do suffer high failure rates.

Using vendor return merchandise authorization (RMA) data, the researchers measured actual disk failure rates in the field against two key benchmarks: annual failure rate (AFR) and mean time to failure (MTTF). Among the findings:

Disk AFRs typically exceed 1 percent, with 2 to 4 percent common and up to 13 percent on some systems.
Field replacement rates of systems are significantly higher than expected based on data sheet MTTFs (by two to 10 times for drives less than five years old).
The rate at which disk drives fail rises steadily throughout their lifetimes, starting as early as the second year of operation, rather than holding flat as is widely expected.

By comparison, the AFR for flash drives (also from RMA data) is just .04 percent — an improvement of 100 times. Broadcasters should therefore expect to replace flash drives far less often than disks. That is because total risk of system failure due to a disk failing equals the risk of one drive failing multiplied by the number of drives. In other words, it would take at least 100 flash drives to have a combined risk equal to just one disk drive.

Some risk-mitigation strategies, such as disk mirroring (RAID 1) and disk rebuilding (RAID 5/6), simply address the problem by adding more disks. These strategies do not address the disk's underlying inherent risk. One result is that these rebuilding servers are in a “degraded state,” where a second disk failure may take out the entire on-air operation. Another is that the strategy itself may not work, particularly because a RAID rebuild assumes 100 percent data integrity on all remaining “good” disks in the server, which is not always a safe assumption. And even when these strategies do work, there is no compensating value (like faster I/O) to offset the costs of adding the extra disks.

High failure rates pose a significant challenge in the SD environment, and even more so as broadcasters move to HD. HD requires three to 10 times as many disk drives as SD to provide HD bandwidth and store the same number of program hours. The likelihood of a play-to-air storage failure will increase in proportion to the number of drives added and the age of the drives. If one server has an AFR of 25 percent, then a mirrored configuration's AFR is slightly less than 1 percent, assuming there is 48 hours mean time between repairs. To achieve the five-nines availability (99.999 percent) broadcasters expect, the AFR must be less than .25 percent — a near impossibility in light of recent research.

Flash's inherent risk

The advantage of flash drives is that they start with a much lower AFR per unit, less than .04 percent. Table 1 shows how that low flash failure rate translates to a low annual cluster failure rate (.23 percent), even without RAID 5 protection on the chassis level. Clusters ranging from three to nine nodes show five-nines reliability, with nine nodes being the worst case. A nine-node cluster holds about 10TB of data on 24 flash memory cards, each of which holds 64GB.

The total reliability of the cluster equals the sum of the subsystem failure rates, which in each subsystem equals the failure rate of each component multiplied by the number of those components (minus any redundancies). The subsystems within each node are the motherboard (one), GigE I/O cards (three), flash memory drives (24), power supplies (1 + 1 redundant) and fans (5 + 1 redundant). Component failure data is based on either RMA data or MTTF reported by vendors.

In an example of a flash-equipped broadcast server, availability of each server node is 99.873 percent. However, reliability of the worst-case, nine-node cluster as a whole is still 99.999 percent — a difference that is directly attributable to the use of flash memory. (See Table 1 on page 14.)

Clustering for flash-assisted storage

Flash-assisted storage is based on clustering, a technology proven in TV operations since 1996, except that flash memory drives replace disks. The key insight is to carefully manage reads and writes to each flash drive so its performance is optimized. Media content data are striped contiguously in two ways: across all drives in a server and also across all servers arrayed as nodes in a cluster. All content has equal and parallel access to I/O, so high ingest bandwidth is achieved while maintaining full playout performance.

With up to 24 64GB flash memory drives in each node, a nine-node flash cluster can scale up to more than 500 hours of HD content at 35Mb/s XDCAM video (or 50Mb/s video and audio combined).

Single-copy flash memory storage is shared among all the nodes in a cluster with N + 1 redundancy, which avoids costly mirroring. And because all data, including parity data, is evenly distributed, no dedicated parity drives are needed. Service continuity is protected even if a node fails, or during in-service maintenance, hot-swapping of drives, system upgrades or installation of additional base nodes.

Clustering also solves the problem of flash write hot spots, when writes repeatedly hit the same flash memory causing it to degrade quickly. Clustering eliminates write hot spots by evenly load-balancing writes across all flash modules so the same memory location may only see a few writes per day, if that. This extends the memory's lifetime to more than 10 years, versus the typical five years for hard disk drives.

Continue on next page

Clustered performance and reliability also come with the greater versatility inherent in solid-state devices. Flash, for example, lends itself to highly granular deployments. So, instead of one large, central on-air storage cluster serving all outputs, several smaller clusters can each serve a few channels, further reducing overall risk exposure. These clusters can even support disaster recovery sites at uplink or transmitter locations where day-to-air content is refreshed daily via WAN or satellite. Flash also tolerates environments considered too harsh for disk, so it is more reliable when broadcasting from a moving truck, car, helicopter, airplane or ship. That versatility is further enhanced by flash memory's “green” value, a 10-to-1 advantage in power efficiency over disks.

Broadcast storage architecture

Flash-assisted storage systems complement how most broadcasters allocate content within their infrastructures, i.e., to partition content between nearline and play-to-air storage. The key difference between these storage tiers is that nearline is optimized to store large amounts of content at a relatively low cost, while play-to-air is optimized for high bandwidth and high reliability.

In a typical broadcast workflow, content is ingested into low-cost, SATA-based nearline storage. Close to airtime (typically within 24 hours), the automation system moves scheduled content from nearline to play-to-air storage. The link between the two must be sufficiently fast to transfer large volumes at speeds greater than real time. Once the content is played and no longer needed, the automation system deletes it from play-to-air storage.

In a flash-assisted storage architecture, the majority of content is stored on a nearline storage server. (See Figure 1.) Within 24 hours of broadcast, scheduled content is transferred to a flash-based solution using very high-bandwidth I/O paths between nearline and flash storage. In this environment, no disk (and its inherently higher risk) can ever impact the broadcaster's play-to-air performance. Extremely high play-to-air reliability and bandwidth are therefore achieved without sacrificing economy. At the same time, broadcasters benefit from all the advantages they would expect from a flash-based, solid-state solution, such as lower power consumption, lower noise volumes, less heat and lower maintenance cost. Perhaps best of all, broadcasters will no longer be delayed by RAID disk rebuilds before they know if their channel will stay on the air.

Stephane Jauroyou is vice president of broadcast sales and marketing for SeaChange International.

Major subsystem components Motherboard I/O cards Flash memory Power supply Fans MTBF (hours) 105,000 408,000 21,900,000 150,000 170,000 AFR (percent) 8.34 2.5 .04 5.84 5.15 Number of components 1 3 24 2 (1+1) 6 (5+1) Subsystem AFR (percent) 8.34 6.44 .96 .002 .01 Subsystem MTBF (hours) 105,000 408,000 912,235 468,750,000 120,416,667 Single-node AFR (percent) 23.14 Single-node MTBF (hours) 37,850 Single-node availability (percent) 99.873 Nine-node cluster AFR (percent) .23 Nine-node cluster MTBF (hours) 3,730,871 Cluster availability (percent) 99.999

Failure rate = 1/MTBF • AFR = (number of hours in a year) × (failure rate) • Number of hours in a year = 8760

Table 1. The reliability of the worst-case, nine-node cluster as a whole is still 99.999 percent — a difference that is directly attributable to the use of flash memory.