Robotic storage libraries

Designing and integrating an RSL into a complex storage subsystem involves a multitude of issues and tradeoffs. Find out how to design an audio network around Gigabit Ethernet with inexpensive, off-the-shelf components.
Author:
Updated:
Original:

Television technology is becoming more sophisticated and complex, and broadcasters’ resources are becoming more limited. As broadcasters deploy highly sophisticated media-storage systems, they need to fully understand, evaluate, optimize and manage the content they create and deliver.


CNBC uses the Grass Valley Profile Network Archive (PNA) system as part of its infrastructure. The PNA supports multiple Profile systems per archive with a RAID-protected archive database.

With storage devices, there is always a trade-off between access speed and the costs of the device and the physical media. Magnetic storage devices have rapidly decreased in price and increased in storage capacity, but they still cannot compete economically with tape-based systems. The capital cost for tape storage is less than 3 cents/MB, while that for high-end, magnetic-disk arrays cost about 30 cents/MB or higher. And that’s just the capital cost; it doesn’t include cost for the occasional maintenance of disk-storage components. (Maintenance on RAID controllers, for example, is reported to be about $7/MB.) And magnetic disks alone cannot satisfy the storage and delivery needs of broadcast applications.

Information-technology (IT) companies developed robotic storage libraries (RSLs) to minimize storage costs and help archive non-essential data files. Now, tape-based RSLs are rapidly gaining popularity with broadcasters because they can reduce the costs of administering and storing the numerous, large files that broadcasters generate when compressing high-resolution video and audio. If you want to take advantage of the low cost of tapes while providing reasonable performance in data-intensive, real-time applications, tape-based RSLs are the best choice.

HSM

In the IT world, data files migrate along a hierarchy of storage subsystems called hierarchical storage management (HSM). HSM provides an intelligent way to move files among storage devices. It ranks the devices in terms of cost per megabyte of storage, speed of storage and retrieval, and overall capacity limits. The HSM rules are tied to the frequency of data access and the archival process. HSM classifies data according to a three-tiered data hierarchy:

Tier One — consists of data that must always be readily accessible and reside on the primary server storage system or attached storage.

Tier Two — consists of data that needs to be accessed occasionally and/or as backup for tier one and reside in a less-expensive data array.


Figure 1. An overall view of a broadcast data-storage system and the RSL’s place within it. Click here to see an enlarged diagram.

Tier Three — consists of archive data that is accessed infrequently and/or needs to be kept according to customer-defined business rules. Usually, this tier employs a tape or DVD RSL.

In contrast to this three-tiered IT HSM, broadcasters usually implement HSM as a two-tiered system consisting of a high-throughput, on-line RAID storage system on the video server(s) and an off-line tape-storage system used mostly as a temporary, low-cost storage system with some deep archive for news and long-form material. An RSL forms the lowest level of many large-scale data storage and backup systems. Figure 1 shows an overall view of a broadcast data-storage system, and the RSL’s place within it. Figure 2 shows a block diagram of such a system offered by Thomson Grass Valley.

In any distributed- or shared-server environment, the distribution of data among the storage devices can significantly affect the system’s overall performance. It’s important to manage the data properly so that it is stored in the appropriate tier.


Figure 2. The elements that make up the Profile Network Archive system offered by Thomson Grass Valley. Click here to see an enlarged diagram.

Inside an RSL

An RSL consists of three key resources: drives, robot arms, and media (tapes or optical disks). The robot arms perform three fundamental tasks: load, unload and move the media to and from the shelves inside the library. The drives perform five operations: load, eject, search, read (playback) and write (record).

To accurately evaluate and predict an RSLs performance, you must understand its workflow and workload. Figure 3 shows the typical workflow of an RSL. Archiving or retrieving a file to or from an RTL requires a number of processing steps. A request from an external managing application puts a job or batch jobs into the request queue. Each job corresponds to a request to load files from a particular tape. When a job reaches the head of the queue and a drive is free (typically there are multiple drives), a robot arm fetches the requested tape from a storage rack, transfers it to a local drive and mounts it on the drive.


Figure 3. An example of the typical workflow of an RSL. Click here to see an enlarged diagram.

The drive then positions the tape at the beginning of the requested file. For each file, the drive must seek to the start of the file and then read the file. The system transfers the file being read to the online storage subsystem, which is sometimes referred to as the disk cache since it behaves like a cache or extension of the RTL. Once the system has transferred all the blocks that make up the file, the drive rewinds the tape and the robot removes it and places it back on the storage rack.

Like any other storage subsystem, the file-system management application purges some data whenever the storage occupancy exceeds a user-defined high-water mark. This process uses certain selection criteria, which are based on user-defined criteria, residence time on the system, file size, etc. The purging process ensures that both the on-line storage and the RSL storage always have a certain amount of free space to process incoming requests.

Technical difficulties

Broadcasters need lots of bandwidth and storage, and their RSL performance and reliability requirements differ profoundly from that in the IT world. One challenging task in designing end-to-end broadcast storage subsystems is satisfying the real-time requirement of continuously moving a file object from the storage subsystem to the play-to-air server.

When an RSL is integrated into a larger data-storage system, requests coming from different parts of the system can compete for these three resources, which can cause queuing delays. A queuing delay for media occurs when a request for the media cannot be served (even though there may be available drives and robots) because the desired media is already serving another request. A robot queuing delay occurs when there is a queue request that requires a robot arm that is not available. Finally, queuing delays for drives occur when there are no available drives to be used by waiting requests.

When integrating tape-based RSLs into a broadcast environment, you must also address the two inherent limitations of tape storage. The first is the limited reliability and convenience of writing to tapes rather than magnetic disks. Unless carefully managed, this limitation can lead to fragmentation of the data written to the tapes. The second limitation is the delay, or latency, in data access.

Latency, also called performance overhead, is the most serious limitation of RSLs. There are three RSL procedures that cause latency. First, there is the latency of loading tapes into the drives (mounting) and preparing them for access (tensioning). This delay can be considerably long if there is no tape drive available at the time of the request (due to a queuing delay). Future technological advances will probably reduce this latency. But, since it is a mechanical preparation procedure, it will always be enormous relative to computer speeds and magnetic-disk access times.

The other two RSL procedures that cause latency result from the intrinsic linear-access nature of tapes. One is fast-forwarding and rewinding. It takes time to search through the tape and locate the beginning of the file containing the required data. A fair estimate of the latency imposed by this process is the time required to seek half way through a tape from the beginning. The third latency-causing RSL process is the search inside the file for the required data. The time it takes for this procedure is determined by the read rate of the tape drive and the file offset. With careful organization of data on the tapes, this latency can be minimal.

In tape-based RSLs, the tape technology employed is either linear or helical scan recording. Linear tapes have their tracks parallel to the tape’s axis, and can be read in batches as the tape moves forward or backward (e.g., serpentine drives). Helical scan tapes have their tracks at an angle with the tape’s axis, and can be read by a drum that rotates in one direction only. This can result in slower search times. Also, tape cartridges differ in storage capacity, with typical values ranging from a few GB up to few hundred GB. High-capacity tapes are often longer and can add to search time.

The latencies caused by these three procedures will probably improve in the future, but they are likely to remain the primary issue broadcasters must address when integrating RSLs into their facilities. There are a few strategies that broadcasters can employ to address latency problems. The first and easiest is to use caching to reduce the frequency of access to the RSL. The simple way to add caching to the RSL is to use magnetic disks attached to the same host as the RSL or very nearby in the HSM chain. Other strategies for reducing the effect of latency involve adding high-level structure or organization to the data, affording structured access.

Economic considerations

There is a common misconception that storage systems are cheap. What is getting relatively cheap is the storage space; there is still a great deal of cost in achieving reasonable performance characteristics (such as reducing latencies and increasing throughput) and developing (and maintaining) custom, high-performance systems.

When considering a storage system, be sure to consider the cost of the following items:

  • Capital for hardware — disk drives, controllers, servers, networking equipment, tape drives, cables, etc.
  • Media cost — tape cartridges, magnetic and optical
  • Software — utilities, backup, media management
  • Installation cost — vendor installation cost, internal personnel cost, consultants, etc.
  • Ongoing cost — hardware maintenance, software maintenance and upgrades, technical support, etc.

Given the (potential) order of magnitude differences in performance and cost of complex storage systems for real-time broadcast applications, it is worthwhile and often necessary to invest a great deal of effort in their design and configuration.

There are at least three reasons why it is essential to hire an engineering/consulting firm to design such systems:

  • Designing and understanding the system requirements
  • System sizing and configuration
  • Predicting and evaluating the performance of the resulting storage system

Designing and integrating an RSL into a complex storage subsystem involves a multitude of issues and tradeoffs. These include configuration-related issues such as the number of levels in the storage subsystem, what devices should communicate with the RSL, device configuration, distribution (I/O channels and/or communications networks), data allocation, etc.

General considerations

The point at which RSL storage becomes necessary is an economic trade-off. Currently it seems that RSL is needed to manage more than a few hundred terabytes of data. Software from companies such as AVALON, SGL and others, provide the illusion that the RSL is an extension of the file system. Since RSL data volumes and access latencies fall between those provided by on-line and off-line storage, RSL is often referred to as near-line storage.

When considering large storage systems, look at the following issues:

  • Capacity — The system must be able to handle current storage needs and the needs of expected future growth (usually a three to five year look ahead). It is almost impossible to plan a good storage strategy without having detailed knowledge of the quantities of data involved now and in the foreseeable future.
  • Scalability — The system must be designed from the beginning to scale to larger data capacities without major upheavals. Outgrowing the system can cause very costly disruptions.
  • Cost — Select the least-costly approach that effectively meets your initial objectives. You must consider many cost issues, including the initial purchase cost of the hardware, the productivity cost related to down time, and ongoing hardware and software maintenance. The more complex a system is, the more attention it will require from administrators and operators, which can translate into hiring more staff.
  • Performance — The system must be able to deliver data as per specifications and design. Designing a system with high bandwidth requirements at high data rates that can still deliver high throughput can be a challenge.
  • Reliability — All systems rely on components that will eventually break down. It is possible to design a system with enough redundancy to ensure that, even if individual components fail or malfunction, no interruptions will occur. Such high availability comes with a price both in terms of the cost of the equipment and in the complexity of the operation. In most cases, it is relatively simple to build a system that it is available 99.9 percent of the time. Adding additional reliability is complex and expensive.
  • Manageability — Once the system has been designed and implemented, it must be maintained. If your staff is not very familiar with large storage systems that required complex networks and configurations, aim for a system with the simplest operational concerns. As systems increase in complexity, it becomes increasingly important to be able to monitor their performance, pre-empt failures and manage media with as little effort and interaction as possible. In some cases, this may require additional staff and/or expertise.

Design and proof of concept

The development of a storage strategy involves planning for the quantity of storage and the level of performance required. The “build it and measure it” approach to system design is inappropriate because, after the system is implemented and its performance measured, you can’t make any significant design changes. There are no specific rules for planning a storage subsystem, but here are some general guidelines:

  • Measure the amount and significance of current data. It should not be difficult to convert current VTR tape needs to data needs. Look for ways to consolidate and simplify. Look for pockets of data that are not part of you current operations. Can they be integrated? Work closely with other departments to clarify and quantify the scope of the project.
  • Plan for growth, carefully. Once you have carefully calculated your data needs, you must predict as accurately as possible your future growth for the next few years. Make sure that you are aware of any special project or consolidations that your organization may be planning in the next few years, like massive archiving and digitizing projects. But don’t purchase too much storage capacity in advance if you don’t need it. You will probably be able to purchase higher capacity storage at less cost in the future.
  • Allocate excess capacity. How much excess capacity should your organization provide the first year of a news storage system? To keep the system running smoothly, you should plan on 15 percent to 25 percent more capacity. Depending on your growth rates, you may want to purchase additional capacity for any anticipated growth for the second and possibly third year. The system may grow beyond its initial configuration, so make sure that the components needed for expansion are likely to be available in the future.
  • Assess return on investment. New, smaller and cheaper storage technologies constantly arise and offer greater benefits than the system that you purchase today. Keep realistic expectations on how long your storage system will last before it needs to be replaced. A five-year life span is typical for these systems before technology changes (new tape drives, robotics, storage, etc).

When choosing an RSL solution, proper design and configuration can play a huge role in its performance and, more importantly, its cost-effectiveness. Poorly sized and designed systems result in wasted resources and money. You, the customer, along with consultants and integrators, must work together to define the number of clients, required throughput, system architecture, disk types and sizes, drive types and so on.

Once the parties involved have specified and agreed to a system, the next step is to run an analytical model of the system components and try to collect as much “real performance” data as possible. To accurately evaluate and predict the system’s performance, you need to estimate workflow and workload, which are a function of the particular application at hand. Such information is user-specific, and includes details such as a request for archive and restore, deletes, defragmentation, distributions of data-access interaction among different applications, and many more.

Risk analysis

When selecting an RTL or storage subsystem, carefully measure the risk of failure for each alternative. Start with a mental model of the system that initially seems to match the current or future workflow, and then work though design changes that both increase and decrease reliability. Weigh the cost implications of the design alternatives against their relative risk factors and the impact of downtime to the organization. And finally, keep in mind the following goals:

  • Preventing data loss
  • Scalability
  • Fast data access to data without interruptions
  • Preparation for equipment failures
  • Use of cost-effective technologies

The success or failure of the system will be determined by how the media is managed throughout the system. It is important to assess how the possible failure of one component or subcomponent can affect your day-to-day operations. Be sure to consider the following factors: Productivity losses — Many, if not most, of the different departments in your organization will not be able to carry out their normal activities. In a busy news environment, having constant access to archive material is the key to success.

Asset-recovery cost — When data is lost, recovering data requires an effort by the technical staff. Re-archiving irrecoverable data can be a massive undertaking.

Loss of revenue — If your organization depends on a storage environment to support daily programming, both long-form and short-form, you may not be able to generate revenue during the period of failure.

It is relatively easy and cost-effective to design a storage system that works 99 percent to 99.9 percent of the time. Eliminating the last few points of downtime possibilities can double or triple the cost of your storage system. The investment your organization makes toward enhancing reliability operates much like an insurance policy. The financial value of the organization’s operation guides the expense you can justify for reducing the likelihood of downtime and reducing the recovery time from failures.

Pablo Esteve is a systems engineer and project manager for Thomson Grass Valley. Contact him at: pablo.esteve.thomson.net

Home | Back to the top | Write us