Understanding Reliability In Media Server Systems

Early adopters of video servers were faced with the question of just how much reliability they could or would tolerate in their server systems given the revenue impact on their operations should a failure occur. Companies that deployed video servers as cache engines for existing robotic tape playback systems perceived only marginal risk, their dependence being a single server with perhaps a RAID-storage system of only a few hours in total. Other braver souls who were abandoning the tape concept entirely might have placed much higher emphasis on protection, often including a dual set of servers configured in mirrored operation, each with RAID-3 or RAID-5 storage arrays to protect each server's data.

INCREASING COMPLEXITY

Video server implementation has recently become far more sophisticated, extending well beyond the rudimentary commercial playout devices. Servers are now deployed in every corner of a television station's facility, stretching into news operations, production, program delivery, multichannel playout and in some operations to an entirely "tapeless" operation. Server architectures must now build in system redundancy with automatic changeover and tertiary protection, secondary browse servers, proxies, remote replication of data, secondary caches for playout or ingest and near-line/offline data archives. The complexities in these systems have grown to the point where the previous standard metrics for reliability and availability have become meaningless.

The operation of the future will expect a high degree of total system reliability, with near five 9s uptime (99.999 percent). Given that requirement, how can one evaluate reliability figures, such as mean time between failures (MTBF) and other industry standard figures of merit? How will you apply the sets of individual performance specifications to an entire system? Honestly, it's not an easy undertaking.

Failure rates, availability and MTBF are all measures that manufacturers use to compare their particular device or product to another. Fortunately, in the data storage and access domain, numerous methods are available to guard against data loss, downtime and reduced availability. These techniques are being extended to the video media domain, but the complications and interactions of secondary products and independent system influences make the overall reliability equation difficult to solve.

In manufacturing, statistical predictions are often used to evaluate and specify the performance of devices, especially devices with overall life expectancies in the hundreds of thousands of hours of operation. Statistical methods employ metrics that consider sample populations of components to gain an average for the overall population. In storage devices such as memory and magnetic hard drives, the populations can be considered somewhat large by virtue that tens of thousands of these devices are generally produced regularly and throughout the product life cycle. These numbers become ambiguous once they are embedded inside subsystems and married to larger components supplied by various other vendors.

Influences outside the memory or storage device itself often have a greater impact on the overall system performance than any given device or set of devices. This becomes evident when several disk drives are assembled into an array pack, and several arrays make up a single storage volume - and then these volumes are addressed by several channels of server I/O. Here, the chance for system failure increases, even though the individual components may have very high reliability figures.

Adding redundancy supplements overall system performance metrics. For data storage subsystems, which is why RAID (with the "R" for redundant) has become so prevalent, the problem of data storage reliability has been reduced to a near nonissue. Applying redundancy in disk drive arrays has been around long enough that most understand at least the concepts, which include variations on mirroring, parity and stripping to achieve protection and increased bandwidth and/or throughput. Other forms of redundancy in nonstorage subcomponents, such as dual Fibre Channel switches and duplication of entire server channels or platforms, help keep the overall reliability factor high enough that any single component failure has little to no impact on system performance.

Given there are numerous techniques for redundancy, and many methods for minimizing SPoF (single-points-of-failure) the other figure given to system performance becomes reliability. Generally, there is little justification and seldom any explanation for the reliability figures specified in a system; yielding to other more respected and understood statements on reliability such as MTBF, mean time to data loss (MTDL) or mean time to data availability (MTDA).

THE LONE DISK

Reliability is a major issue in mission-critical operations. When a device stands by itself, with no redundancy, the MTBF becomes an isolated metric that if reached, places considerable risk on both data availability and data reliability. When that single lone disk fails, it immediately affects the system and the user. There is no question that corrective action must be taken, and the consequences become evident should there be no other protective or backup measures.

For data storage subsystems, the methods of "calculating" reliability depend upon some fundamental definitions. The RAID Advisory Board (RAB), in its 1993 RAIDBook, developed some of these definitions that have remained (for RAID disk arrays as compared to nonredundant disk storage) the metrics of reliability predictions in drive arrays for nearly a decade. Yet, these figures still depend upon large populations of devices and have little overall bearing once deployed inside a larger system with numerous subcomponents and a variety of control architectures and software. Still it is useful to understand what the definitions are and how a given manufacturer or integrator applies them to the overall system.

Data availability deals with the ability for an application to correctly access data in a timely manner, while data reliability focuses on the preservation of data's correctness. In other words, an end user hopes that correct data will not only be available when they need it, but that the data will also be reliable (preserved correctly).

(click thumbnail)Fig. 1
The MTBF figure is relevant, as stated previously, when large populations (i.e., sample sizes) are considered. MTBF is a measure of how reliable a product is, usually given in units of hours. The higher the MTBF, the more reliable the product is. Electronic products assume that over the useful operating life of the component, there will be a constant failure rate that follows an exponential law of distribution.The MTBF figure for a product can be derived in various ways: lab test data, actual field failure data or prediction models:

MTBF = 1/(sum of all the part failure rates) and the probability that the product will work for some time (T) without failure is given by:

R(T) = exp(-T/MTBF)

Fig. 1 shows two series of products, Series 1 with an MTBF of 250,000 and Series 2 with an MTBF of 100,000 hours, respectively.For the calculations, the operating period (years in service axis) is converted to hours (T); thus, for an MTBF of 250,000 hours and an operating period of interest of 5 years (43,800 hours):

R = exp(-43800/250000) = 0.839289 = 83.9 percent

The chart indicates that at year 6, there is an 81-percent probability that the product with MTBF = 250,000 hours will operate without a failure or that 81 percent of the units in the field will still be functioning at the six-year point. When the MTBF is 100,000 hours, that probability falls to 59.1 percent.

An alternative method states that MTBF is the sum of the mean time to failure (MTTF) and the mean time to repair (MTTR). The MTTF for a component may be obtained by analyzing historical data or using standard prediction methods (such as MIL-217, Telcordia TR-332 Version 6, RDF 2000 and NSWC) . For a constant failure rate, the MTTF is equal to 1/lambda, where lambda is the failure rate of the component.

Care must be given to weighing the validity of these figures once the components are placed into a system. At that point, as alluded to previously, additional influences and other factors associated with the individual performance of subsystems carry more weight than individual MTBF figures of disk drives or arrays.

When considering an entire system, one must understand the number of operations or cycles that are sampled. Furthermore, when computing the MTBF, any measure of operating life may be used, such as time, cycles, kilometers, or events. Manufacturers and systems vendors may also include other figures of merit, such as MTDL and MTDA, in the predictions of reliability.

Of course, some things do break before they reach the MBTF point, just as a seemingly identical device will work long after it has passed the MBTF point. Your focus on these numbers should be in direct proportion to your organization's dependence on the functioning of this unit. The loss of commercial insertion server can quickly lose revenue that exceeds the cost of the server, while the loss of a single drive in an eight-drive RAID server may have no effect on the system's overall operation, depending on the type of RAID system that is being used.

The bottom line is that video server system components are becoming near 99.999 reliable in the hardware size, especially when standing alone. Once these components and subsystems are integrated into packages that are addressed by multiples of outside services, such as caches, automation systems, archive fetches and editing controllers, the complexion changes _ hence, the need and in many cases requirements for multiple paths of redundancy. Don't shortchange your revenue stream by reducing the components of the system and in turn reducing that MTBF.