Tools and Practices for Managing Media Storage

The reliability of media server systems and their storage networks continues to be a growing factor in determining which systems are employed and how. As the systemization of these components becomes more integrated into operational workflow, the complexity increases, and the dependency upon them grows. In turn, the associated storage systems command an even higher reliability factor.

In turn, the process of managing these storage networks becomes more important, and the tools needed to make critical decisions become all the more necessary.

Last month, this column described the issues surrounding the archive manager segment of an overall media server system; it seems appropriate to continue the discussion, so this month the topic will focus on the management of the storage network component.

Until the last few years, management of data on a video server was governed primarily by resident applications. When storage totals were less than 100 hours, this was not a difficult task. However, as newsroom editorial systems were required to handle a heavier workload, the process of managing storage began to require more complex system-level controls. Optimizing data, disk management, bandwidth, throughput and availability for the overall system became the goal.

Borrowing again from the computer world, video server manufacturers looked to enterprise data- and transaction-server systems to address very large-scale storage arrays, including storage area networks, the fabrics associated with them and distribution over wide area networks.

Short- and long-term management of these systems requires tools to support administrators and technicians tasked with the upkeep of these ever growing networks. The tools require the ability to identify, diagnose, repair and hopefully prevent critical faults. These tools can help establish parameters in which the enterprise can successfully operate.

GHOSTS IN THE MACHINE

One group of fundamental issues that can plague any system include catastrophic, intermittent or pathological failures, hardware and software upgrades and pure "bugs" in the system.

The suddenness of a catastrophic failure, because it is unpredictable, is for the most part obvious and usually straightforward to deal with. This failure may be small scale, e.g., a power breaker fails, an unsuccessful or a cooling fan that stops. Or the failure can be monumental, such as an entire controller fails with no backup, in which case all data retrieval ceases until the controller is physically replaced.

Redundancy still seems to be the best insurance policy for avoiding or mitigating a catastrophic failure, but it's not without significant cost. Disk mirroring, redundant arrays or subcomponents, dual system drives, dual power supplies and fans, etc., are well known measures that help protect the revenue-generating machine.

However, redundancy is only a Band-aid that buys the operator time to deal with the resolution. Redundancy can also be valueless if the backup system has not been exercised, and it, too, fails when called to duty.

Preventative maintenance, including trend analysis and tracking, will help mitigate risk; but an aggressive budget of spares and personnel training will alleviate possible risks that can only be assessed for effectiveness during the actual failover and often overlooked failback process.

Intermittent failure is by far the most difficult to deal with. Identifying conditions that cause the problem, finding an opportunity to "test" in a real-time operating environment, or having an expert analyst available to experience (let along diagnose) the problem are very real and untenable circumstances that ultimately become unmanageable during a crisis. Intermittent problems are generally viewed as bugs in the software or firmware.

Because bugs are often hard to reproduce, or the conditions in which they arise cannot easily be duplicated, bugs are usually the last thing to be resolved. All the testing in the lab or factory can never duplicate the myriad real-life applications created by end users.

Testing for pathological failure involves anticipating highly improbable conditions. Pathological test systems have been used in exercising serial digital video (SMPTE 259M or 292M) for years. For video servers and associated storage systems, the accepted method for basic pathological testing is to take the system limitations to their extremes. This could mean, for example, all the input coders running at full real-time ingest, all the decoders playing back full resolution simultaneously, while at the same time data is FTP'd between server and backup systems, and calling for the shortest possible back-to-back segment play-outs. Then as a last resort, physically yanking one of the drives from the storage array and forcing a data rebuild with its replacement drive installed.

This sort of example will tax the availability of the storage system and carries with it other possible connotations. In its generic sense, "availability" can be defined as the binary metric that describes whether a system is up or down at any given time. Statistically, the extension of this definition is used in computing the percentage of "up time" on a system. Traditionally, this is how percent-availability is defined--for example when a system has 99.999 percent (i.e., five-9s) availability. However, reliability and availability are not the same, although generally they are classified into the same perception. Reliability is a long-term metric, availability is a continuous, almost instantaneous metric. A system may be highly reliable because it is seldom used; while a system may be highly available up until the time it becomes unreliable.

READING THE TEA LEAVES

Storage systems administrators seldom have the time to evaluate potentially harmful conditions that might affect availability or reliability. As an aid to the administrator, storage vendors are providing data to the operating or control systems of either the server itself, or to an external monitoring platform (sometimes through SNMP, often through dedicated or resident analysis routines). This data is accumulated into groups designed to watch the performance of the system at all levels. Often referred to as "trending data," this information is valuable in determining when a drive or other subsystem is nearing the fault point. Trending data can be set with thresholds that become statistical boundaries whereby a flag is raised and preventative action can be taken before a failure brings the system down or reduces performance beyond an acceptable level.

Trending data for ascertaining if the storage system meets its benchmarks is comprised of fault categories. These include items such as correctable media errors on reads and writes--used to simulate disk sectors starting to go bad; uncorrectable media errors on reads and writes--to identify unrecoverable- or damaged-disk sectors; hardware errors on any SCSI command--to identify firmware or mechanical errors; parity errors at the SCSI command level--indicating possible SCSI bus problems; power failures, such as increased current or voltage changes at all levels; and disk hangs forcing firmware bugs or failures both during and between SCSI commands (these appear as SCSI timeouts to the controller). The majority of these items would seldom be reviewed on a continuous or daily basis. However, with trending thresholds and other indicators fully established, the reliability of the overall system is improved with only a modest level of active administration.

Many analysis and monitoring systems are connected through a Web browser interface, allowing for the extensibility of the monitoring platform well outside the physically hard-wired connections of half a decade ago. Through HTTP, XML or Java-scripting, the trending and status information can be readily available to a variety of third-party external monitoring systems.

NAVIGATING AN UPGRADE

The most risky task in managing the overall storage network is the systems upgrade. Both hardware and software updates are potentially the most difficult of the administrator's tasks. When performing an upgrade or incremental fix, be mindful of the influence and interplay of all associated systems. Often, editorial components, MOS-enabled third-party systems, facility automation systems and dozens of others can be affected. Furthermore, if operating system parameters are changed, new anomalies are introduced that must be worked through at all levels of the systems. Whenever contemplating an update or upgrade, be sure all associated components remain compatible. Check with each secondary and third-party vendor to ascertain if they are aware of the new version and if they have actually qualified it with other components. Keeping hardware and software versions synchronized is a challenge that complicates system performance, reliability and availability.

In conclusion, for years the broadcast and media technology industries have been preparing current and new installations for the inevitable near lights-out or hands-free operating environment. In support of this, major vendors of broadcast terminal equipment, on-screen displays and video servers have offered the tools to manage their systems on a tiered basis--e.g., initial fault identification, elementary diagnostics, long-term or deep analysis with trending, and in some cases, even automated failover or corrective action. As the entire broadcast chain moves further from base-band video toward a file-based architecture, storage management--coupled with training and simulation--will become a part of this tier-level service practice, especially when integrated into the centralization of both surveillance and operations.