The British rock group Queen had a top-10 hit in 1989 with “I Want it All.” The song included and was based around the catchphrase from singer/guitarist Brian May’s second wife, “I want it all, and I want it now!”
Of course, that is what broadcast storage managers hear every day. It isn’t enough to have tape archives; news editors want instant access to every inch of footage ever shot, marketing wants to know how a show is trending on Twitter as it airs, and customers demand access to music and videos that fit their personal mood at that moment.
From a data storage aspect, this means that nothing is ever old or offline, but must be able to be ready at the push of a button. Welcome to the world of Big Data.
What is Big Data?
Big Data is a fuzzy term: It doesn’t apply to any specific amount or type of data. But, as companies have created petabytes and exabytes of unstructured and semistructured data, all this data needs to be corralled, brought under control, understood and analyzed. As a rule of thumb, if there is too much data to be efficiently loaded into a relational database, it is Big Data, and specialized tools are required to turn the raw digital data into intelligence that can be used for decision making.
Although much of the emphasis is on the analytics, that is not the only aspect. For most industries, the storage and management of the data is also critical; for broadcasters, distribution also plays a huge role, given the rise of on-demand programming.
Big Data tools
Managing Big Data requires the development of a new set of tools to organize and search the data. There are now hundreds of such tools, with more on the way. Many of these are open source, or are commercial implementations of open-source software. Among the most broadly used Big Data tools are:
- Hadoop (hadoop.apache.org/). This is an open source Java-based programming framework for distributed computing of large data sets. It can scale from one server to thousands, and redistributes the work in the event of one or more nodes failing. It works with Windows, Linux, BSD and OSX.
- Hive (hive.apache.org). This is a data warehouse system that “facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in Hadoop-compatible file systems.” It is designed for batch queries/processing or large data sets and uses HiveQL, a language similar to SQL.
- MapReduce. Google developed MapReduce to address the issue of how to manage the parallel processing workloads required to process massive amounts of data. The MapReduce framework consists of two parts: Map, which distributes the load out to the compute nodes, and Reduce, which collects and merges the results from those nodes into a single result. Google researchers Jeffrey Dean and Sanjay Ghemawat published an introduction to the MapReduce framework at the Sixth Symposium on Operating Systems Development and Implementation (OSDI) in 2004, the year Google used it to replace its earlier indexing algorithms. The paper and presentation slides can be downloaded at research.google.com/archive/mapreduce.html.
- NoSQL databases. These are used to manage and analyze massive sets of unstructured data. NoSQL doesn’t mean “no SQL,” but “Not Only SQL.” Structured query language (SQL) can still be used as appropriate. NoSQL databases include Apache Cassandra (cassandra.apache.org), originally developed by Facebook; Amazon SimpleDB (aws.amazon.com/simpledb/); and MongoDB (www.mongodb.org/).
Implementing Big Data storage
Although Big Data analytics is one of IT’s hottest topics these days, broadcast has been in the Big Data business for decades, having to manage media storage containing files that may be hundreds of GB or even several TB each. Big Data storage, then, takes on a different aspect than when an organization is just using Big Data for analysis. Some of the factors to consider include:
- Data deduplication/compression. A prominent feature of many storage systems is their ability to reduce the amount of disk space needed through use of deduplication or compression technologies. These technologies, however, won’t have much impact on broadcasters’ storage needs. Audio and video files already contain some type of compression as part of their encoding; any further compression would lose quality. In addition, there aren’t as likely to be as many copies of a single video file that could be replaced with a pointer as there are with e-mails or Word documents. When there are multiple video file copies, they are likely to be needed for backup or for streaming.
- Tiered architecture. Given the breadth of data types that make up Big Data, and the different uses it is put to, a tiered storage architecture is essential. SSDs or NAND Flash memory cards can be used for analytics, storage system metadata, indexes and editing; higher-capacity, lower-cost SAS; and SATA drives for other primary and secondary storage.
Although some types of businesses are dumping tape, it can still play a key role for broadcast for setting up active archives. Video files are massive, and it can be prohibitively expensive to keep the entire archive on disks. A May 2013 TCO analysis from The Clipper Group, Inc. (www.clipper.com/research/TCG2013009.pdf) found that archiving on disk is 26 times as expensive as tape over a nine-year period. A tape library is slower than disk access, but each LT0-6 tape holds 2.5TB of uncompressed data, and a single Spectra Logic T950 tape library scales up to 10,200 slots, enough to hold more than 25PB of uncompressed data (62TB compressed). This allows fast, though not instantaneous, access to archives, which is good enough for archival footage.
To speed access to the videos stored in a tape library, or even lower-tiered disks, the metadata and the head of the video can be stored on the SSD drives, so access can begin immediately while the rest of the video loads from the tape.
- Rapid setup of multiple copies. Broadcast storage systems need the capacity to rapidly add and remove content from the storage being used for streaming video, scaling the number of copies to meet demand. As Joe Inzerillo, SVP and content technology/CTO for Major League Baseball’s Advanced Media Group describes it, “No one cares about a 0-to-3 baseball game, until in the 7th inning, when people realize the pitcher is throwing a perfect game. Then everyone cares, and your online hits go through the roof.”
- Common file system. Some editing or digital asset management systems use a proprietary storage format, which locks the company into a particular vendor and limits its ability to adopt a common storage framework for all company data. By using industry-standard methods such as NFS, CIFS and HTTP, the files can be accessed by both old and new systems, giving greater flexibility.
File system architecture has proven that it works for streaming online and on-air programming, allowing for multiple mounts for a single set of data, and it can increase throughput and capacity as needed.
—Drew Robb is a freelance writer covering engineering and technology. He is author of the book “Server Management of Windows System,” published by CRC Press.