The software provides an open, efficient approach to content analysis.
All broadcasters appreciate the freedom to expand their equipment infrastructure with hardware and software of their own choosing. Even the tidiest and most coordinated broadcast equipment bays house devices added to meet a requirement not anticipated at the time of original system installation.
So what happens when a television channel decides that it would like capabilities unforeseen during the original specification of its asset management system? Obviously the original designer is contacted and asked if a supplementary software or hardware is available to provide these new features. In many instances, the extra module will already have been designed to meet requests from other customers and can therefore be added at short notice.
In mid-2009, Pharos, a developer of broadcast content management systems, began development of the Open Framework for the Analysis of Rich Media (OpenFARM) project with Kingston University under the umbrella of Knowledge Transfer Partnership, a scheme partly funded by the British government to encourage collaboration between businesses and universities. The objective was not to design a product or toolset specific to Pharos, but one that would be open to any software designer, any manufacturer or any broadcaster that wanted to develop asset management plug-ins.
OpenFARM offers broadcasters a framework to add third-party software modules to their content management systems in essentially the same way that they select hardware for their central apparatus room. At its center is the construction of algorithms to identify, extract and interact with metadata in, for example, the video signal stream. Implementing the framework means that individuals or organizations wanting to develop such algorithms in the future can concentrate on their specific objective without having to build an entire operating framework.
An uncataloged media asset database typically contains hundreds of hours of content that can be extremely costly to view, identify and catalog manually. Software developed using the framework can automate the cataloging process, saving thousands of euros. The framework makes it possible to extract all kinds of information (metadata) from video and audio media assets such as silent periods in an audio track, scene cuts, embedded text, logos and fades to black.
If a broadcast channel wanted to incorporate video analysis, such as a scene cut detector or a face recognition feature, into its asset management system but did not want to spend hundreds of hours in development and integration, the framework can provide easy integration in just a few hours.
Radical changes in the workflow of a media management system traditionally imply retesting and redesigning the software, or even major changes to the overall system architecture. The framework decouples all this by managing the video analysis components itself and feeding the content to the user/application only when queried, avoiding any proliferation of extra databases and third-party dependencies.
The framework allows easy integration of media analysis components into broadcast system software. It is open source and published under a Massachusetts Institute of Technology license that allows it to be modified, extended and shipped with commercial applications.
The framework permits fast integration of advanced analysis components — such as scene cut detection, text extraction and speech detection — into any broadcast system software, regardless of the platform or programming language in which it was developed. Features such as a template pattern allow developers to design and develop analysis component plug-ins without being coupled to specific programming languages. Some of the frameworks features are:
It is free and open source, and it is written in Java and distributed across multiple machines.
It provides real-time media analysis.
It is database independent. All indexing metadata is returned to the client application (the broadcast software), or cached until such time that the client asks for it.
Concurrent analysis of media is coordinated by a single entry point (OpenFARM Manager) that is integrated with broadcast software.
It provides an internal soft GUI and a technology to generate client code for a specific broadcast application.
Multiple machines can be used to run the analysis components. Indexing tasks can be added to the schedule easily.
Load balancing is provided on the available processors for cost-effective and fast content analysis.
The analysis components can be written in practically any language (such as C, C++, C#, Java or Python) to run on any operating system, including Linux, and Windows.
There are established methods for the identification of specific objects, categories, events and actions from video and audio data. The recognition of objects in still images, given a training image or set, has received much attention. In the context of content-based retrieval, the user may supply an image of the object with the intention of retrieving images of similar objects. The motion depicted in video data provides an additional dimension to use in content-based retrieval, enabling searches on actions and events.
There are many existing systems that embed content-based retrieval technology, working on either offline or online repositories. A key feature for commercial systems such as Picasa and Flickr is the detection and recognition of faces in photographic albums. MPEG-7 description schemes provide a standard means of representing metadata.
OpenFARM architecture comes in three stages: analysis, management and deployment. Each of these stages encompasses a set of units responsible for a specific type of processing. The analysis stage is responsible for loading the analysis components plug-ins and scaling them across a single or multiple platforms. The management stage is responsible for coordinating the analysis stage units and constitutes a single entry point between the user/application and the framework. The deployment stage is the key code to OpenFARM to be implemented in the media management software. Also part of this stage is a soft GUI that provides a control center to the framework. Table 1 indicates the types of analysis that are expected to be supported. These may be written by third-party developers and plugged in to the broadcast system using the framework designed to address the following requirement listed in the table.
Continue on next page
The model depicted in Figure 1 provides a suitable starting point to design a system satisfying the above requirements. The bottom layer (video analysis) encompasses the signal processing algorithms responsible for extracting the metadata (analysis components). In contrast to most existing initiatives is the transfer of whole batches of analysis results to the client application. Thus, the client application has the scope to store and use this metadata to provide browse and query services.
Deployment in a live system
The technical challenge is to allow the framework's algorithms to be deployed in a live broadcast system. A stand-alone media analysis system would introduce substantial inefficiencies. Individual users have their own sets of media analysis requirements and, because there is currently no single “silver bullet” that can deliver all content analysis, an explicitly modular architecture is more appropriate. The specific requirements are:
To allow manufacturers of broadcast systems to incorporate a wide range of signal processing algorithms for rich media analysis.
To provide an efficient means for users of broadcast systems to search for content.
To provide a simple environment in which signal processing technology can be deployed across various broadcast systems, with differing time stamp, operating system, security and storage requirements.
To provide an architecture that avoids proliferation of databases.
To provide an architecture in which the media processing is performed with maximum efficiency.
To support near real-time performance.
To ensure that deployment can be scaled arbitrarily so that multiple processing components can deliver results concurrently to multiple users.
To ensure that implementation is robust, easy and conforms to commercial standards.
To support proprietary and open-source analysis components.
To provide a facility to relate hierarchical metadata structures for use by related analysis components.
As part of its involvement in OpenFARM, Pharos created two example plug-ins. One detects video scene boundaries, and the other converts program credits into searchable text that is used to create metadata such as the names and titles of performers and production crew. Pharos also put an OpenFARM interface into its Mediator content management system so the data from each plug-in becomes an accessible resource.
The framework is freely available to any company that wants to develop plug-ins or use third-party plug-ins in its own systems. It provides a simple execution environment for analysis components in a robust and scalable system.
Filipe Martins is lead software engineer and Jeremy Blythe is chief software architect for OpenFARM.
|Analysis component ||Description ||Metadata elements |
|Scene cut detector ||Finds transitions between scenes ||Single time stamp |
|Silence detector ||Finds periods of silence ||Mark-in/out time stamp |
|Text detector ||Finds on-screen text, including captions, subtitles, credits, in-scene text ||Single time stamp, bounding box, text string, category (e.g. credit role, credit name) |
|Logo detector ||Finds logos, indicating company and category of logo ||Mark-in/out time stamp, bounding box, company, category |
|Face detector ||Finds all faces in the video; designates them as male, female or not sure ||Mark-in/out time stamp, bounding box, category |
|Person recognizer ||Finds all people, labeling each with an identifier, either arbitrary or real name ||Mark-in/out time stamp, bounding box, identifier, name |
|Speech detector ||Detects speech in rich media content; gives an in-point and out-point for each person ||Mark-in/out time stamp, identifier |
|Speech recognizer ||Transcribes speech to text for each speaker; includes indication of category, e.g. commentary, interview, etc. ||Mark-in/out time stamp, transcript of text, speech category |
|Music detector ||Detects music ||Mark-in/out time stamp |
|Music recognizer ||Detects music ||Mark-in/out time stamp, music identifier (artist, track etc) |
|Audio event detector ||Detects categories of audio content, such as laughter, applause, explosions, etc. ||Mark-in/out time stamp, event category |
|Scene classifier ||Categorizes the content into one of several “setting” categories, such as inside, outside, etc. ||Mark-in/out time stamp, scene category |
|Object category detector ||Detects examples of generic categories of object, such as vehicle, animal, building, etc. ||Mark-in/out time stamp, object category |
|Object detector ||Detects examples of specific categories, such as Porsche 911, cheetah, Windsor Castle, etc. ||Mark-in/out time stamp, object category |
|Video event detector ||Detects categories of events depicted on video, such as goals, fights, kisses, etc. ||Mark-in/out time stamp, event category |
Table 1. Examples of media analysis tasks that the proposed framework is designed to support