Audio and video compression

If there's not a way to describe, characterize and find content, does it really exist? Those of you who have been reading my columns for a while know
Publish date:
Social count:

If there's not a way to describe, characterize and find content, does it really exist?

Those of you who have been reading my columns for a while know I have often pontificated on the wonders of audio/visual compression. Also you have read my views on content not really being useful unless it can be characterized and found. Over the years I have written about companies like Virage that have been working on systems to automate the classification and recognition of content.

I find myself smack dab in the middle of yet another hot area — speech and data recognition. So I look to my old friends on the MPEG committees once again for help with standardized data models that describe content.

Its current thrust is MPEG-7, the completion of which is scheduled for July 2001.

MPEG-7, formally named “Multimedia Content Description Interface,” aims to create a standard for describing the multimedia content data that will support some degree of interpretation of the information's meaning and can be passed onto or accessed by a device or a computer code. According to the official MPEG website, MPEG-7 is not aimed at any one application. It will provide standardized support to a broad range of applications.

MPEG-7 tries to solve the problem of searching and managing huge amounts of digital data. The question of identifying and managing content is not just restricted to database retrieval applications such as digital libraries but extends to areas like broadcast channel selection, multimedia editing and multimedia directory services.

While audio and visual information used to be consumed directly by humans, increasingly audio/visual information is created, exchanged, retrieved and re-used by computational systems. Image understanding (surveillance, intelligent vision, smart cameras, etc.), media conversion (speech to text, picture to speech, speech to picture, etc.), information retrieval (quickly and efficiently searching for various types of multimedia documents of interest to the user) and in-line filtering of content receiving multimedia data items which satisfy the user's preferences) represent specific computer uses of this information.

The goal of the MPEG-7 standard is to develop forms of information representation that go beyond the compression-based (such as MPEG-1 and MPEG-2) or even objects-based (such as MPEG-4) representations. Since the standard can be passed onto a device or a computer code, content encoded in the MPEG-7 standard could be referenced in many different ways. Possibly a verbal reference to a news item, such as “Florida elections,” could bring up your latest newscast on the recount.

The MPEG-7 descriptions do not depend on coded representation of the material. The standard builds on MPEG-4, which provides the means to encode material as objects having certain relations in time (synchronization) and space (on the screen for video, or in the room for audio). If the material is encoded using MPEG-4, it will be possible to attach descriptions to elements (objects) within the scene, such as audio and visual objects.

The committee decided to base the textual representation of content descriptions on the Extensible Markup Language (XML) as the language of choice. The growing popularity of XML usage on websites to describe and collect data will facilitate interoperability in the future.

When MPEG-7 is fully implemented, it will make the Web as searchable for multimedia content as it is searchable for text today. This would apply especially to large content archives, enabling people to easily identify content. The information used for content retrieval can be used by computational agents for selection and filtering of personalized material.

On the consumer side new applications based on MPEG-7 descriptions will allow fast and cost-effective usage of the underlying data. My favorite would be an application that allowed semi-automatic multimedia presentation and editing.

The information representation specified in the MPEG-7 standard provides the means to represent coded multimedia content description information. The entity that makes use of such coded representation of the multimedia content is generically referred to as “terminal.”

As Figure 1 shows, the delivery layer encompasses mechanisms allowing synchronization, framing and multiplexing of MPEG-7 content. The transport/storage of data can occur on a variety of delivery systems. The delivery layer demuxes data so it can provide the compression layer with elementary streams. Elementary streams consist of consecutive individually accessible portions of data called access units. An access unit is the smallest data entity to which timing information can be attributed.

At the compression layer, the flow of access units is parsed and the content description is reconstructed. Access units are structured as commands encapsulating the description information. Commands provide the dynamic aspects of the content. They allow a description to be delivered in a single chunk or to be fragmented in small pieces. They allow basic operations on the content such as updating a descriptor, deleting part of the description or adding new Description Definition Language (DDL) structure.

The main tools used to implement descriptions are Descriptors (Ds) and Description Schemes (DSs). Descriptors bind a feature to a set of values. Description schemes are models of the multimedia objects and of the universes that they represent, e.g. the data model of the description. They specify the types of the descriptors that can be used in a given description and the relationships between these descriptors or between other description schemes.

The DDL forms a core part of the MPEG-7 standard. It provides the solid descriptive foundation by which users can create their own description schemes and descriptors. The DDL defines the syntactic rules to express and combine description schemes and descriptors.

The DDL has to be able to express spatial, temporal, structural and conceptual relationships between the elements of a DS, and between DSs. It must provide a rich model for links and references between one or more descriptions and the data they describe. In addition, it must be platform and application independent and human- and machine-readable.

A DDL Parser capable of validating description schemes (content and structure) and descriptor data types, both primitive (integer, text, date, time) and composite (histograms, enumerated types), is also required.

The purpose of a schema is to define a class of XML documents by applying particular constructs to constrain their structure: elements and their content, attributes and their values, cardinalities and data types. Schemas can be seen as providing additional constraints to DTDs or a superset of the capabilities of DTDs.

XML is an excellent choice to adopt as the basis for the DDL because of its potential widespread adoption and the availability of tools and parsers. As XML was not designed specifically for audio/visual content, certain specific MPEG-7 extensions are required.


The MPEG-7 descriptors describe the following types of information: low-level features such as color, texture, motion, audio energy and so forth; high-level features of semantic objects, events and abstract concepts; content management processes; and information about the storage media.

It is expected that most descriptors corresponding to low-level features will be extracted automatically, whereas human intervention will be required for producing the high-level descriptors.

The DSs are categorized as pertaining specifically to the audio or visual domain, or pertaining generically to the description of multimedia. For example the generic DSs correspond to immutable metadata related to the creation, production, usage and management of multimedia, as well as to describing the content directly at a number of levels including signal structure, features, models and semantics. Typically, the multimedia DSs refer to all kinds of multimedia consisting of audio, visual and textual data, whereas the domain-specific descriptors, such as for color, texture, shape, melody and so forth, refer specifically to the audio or visual domain. As in the case of descriptors, the instantiation of the DSs can in some cases rely on automatic tools, but in many cases will require human involvement or authoring tools.

In this new paradigm we are given a glimpse of a world in which all data/content is accessible by various devices using real world descriptions/locators. With the constant influx of so much new content, if there's not a standardized way to describe, characterize and find it, does the content really exist at all? Does content exist, if there is not a system to find it?

The majority of this article was taken from N3752- Overview of the MPEG-7 Standard (version 4.0). For more on MPEG-7 go to

Steven M. Blumenfeld is currently the GM/CTO of AOL — Nullsoft, the creators of Winamp and SHOUTcast.