AI enables monetization of media archives

“We don’t even know what we’ve got in our archive…”

Comments like this are all too common in the industry as media organizations have woken up to find their organically grown, sprawling content archives missing a key bit of information: metadata.

Television broadcasters and other owners of large content libraries are facing this problem due to the sheer volume of media assets locked up on data tape, with incomplete or idiosyncratic information about exactly what has been stored. Without complete and accurate metadata, it’s difficult to make decisions about the worth of a given media asset, and that makes content libraries difficult to monetize.

How many times have you heard of a film restoration archivist (while looking for something else) “finding” a supposedly lost asset in the vaults, and thus making a restoration project more complete, unique, newsworthy, etc.? Many media companies know all too well that the ability to store data versus its value are often misaligned. Companies have long been looking for tools to help drive value from assets that were created and acquired at great expense but that have become “lost” with little hope of being found.


Long ago (in internet time), we saw the rise of “big data,” and watched it become “analytics,” which has become “deep learning” (a refinement of machine learning), which is itself part of the catchall term called “artificial intelligence.”

But it’s all a variation on the same thing: performing algorithmic queries on deliberately acquired data about customers, products and services in order to yield actionable intelligence.

The media distribution business has embraced analytics. Successful media and entertainment distributors had to develop decisionmaking capabilities allowing them to respond rapidly to constant change, while also integrating rights, digital supply chain, web and social media data. This has helped distributors gain powerful insights about both their media assets and customers, which further informs the types of programming they are willing to invest in.

This use of analytics has opened a window into a critical link between customer preferences, monetization and what is in your media archive. In a recent study by Nielsen of the subscription video-on-demand (SVOD) consumer viewing habits, what’s driving 80 percent of the time spent viewing these services is the back catalog of content acquired by streaming services from television networks and studios. “…Our research shows most of the viewing time is spent with catalog programming,” said Nielsen’s COO Steve Hasker. In short, the content in media archives is driving the majority of the new and rapidly growing ways consumers are viewing media.


Creating metadata has typically been a manual process, where an informed and knowledgeable person can “tag” (or assign metadata to) a media asset, and armed with an appropriate taxonomy, standardize the description of the asset so that search techniques can be used with confidence.

But this approach doesn’t scale. There aren’t enough qualified people, let alone enough infrastructure to let users watch and tag content interactively. Further, video and audio need to be watched and listened to in real time, and there are literally thousands of years’ worth of material which may be trash—or treasure. No one will know without evaluating the asset.

Every rights holder has had to migrate their data—their companies’ precious capital assets—from an older medium to a newer one, simply to preserve its ability to be read in the future. This is a labor-and time-intensive process, which doesn’t add any intrinsic value to the media assets, but has to be done regardless. What if you could migrate your data once, add value to it as part of the migration, and then never need to migrate to another tape format again?

If you could “automagically” add metadata to your content archive—through a combination of AI techniques that watch and listen to your library content and build up a user-referenceable database of people, places, things, even sentiments—you would have the ability to create a new type of programming, where archival material could serve as context for current-day narratives much more easily than such programming is created today.


As multiple sources of data proliferate within an organization, new AI and machine learning techniques evolve to make cross-correlations visible to departments that previously did not have visibility into this data. This promotes companies’ adoption of a “platform” approach to metadata collection, instead of an application-specific approach, where on-set or production metadata might not be considered useful in the digital supply chain or distribution metadata.

Applications that can analyze and correlate media assets with sources of user data can turn these static media assets into “data capital”—an asset class which will continue generating revenue throughout the life of the asset, much like real estate continuously generates income for its owners. Like real estate, the owners of data capital will have to invest in maintenance in order to keep generating revenue from that asset class. The ability to rerun an updated machine learning algorithm against an existing media asset library, correlated with more recent user data such as social media feeds or sensor data from location-based entertainment, may tease out previously overlooked narratives, or themes that can then inform new uses for the content.

However, the sheer size of media assets can make copying data from a passive archive like a tape library less agile than required for a quick reaction to market or celebrity news. A “data lake,” or globally scalable storage fabric consisting of scale-out NAS and geo-scale object storage systems, allows multiple workloads such as archiving, disaster recovery and collaboration to be executed without requiring multiple silos for each.

Now imagine you can weave an AI appliance into that data lake, which can be trained on your own data capital without having to migrate your data somewhere else for that purpose! This architecture will allow AI algorithms to perform facial recognition, object recognition, audio transcription, and even translation on media assets, in order to identify and track features which are useful to advertisers, piracy watchdogs, organizations tasked with identifying manipulated imagery, and others.

The appearance of toolsets which have moved “up the stack” from raw computer science algorithms, and become software frameworks and applications that run on scalable clustered appliances without having to migrate vast amounts of data has ushered in a new era of revenue generation. This in turn allows content rights-holders to breathe new life into their existing pipeline of content creation, content management and content distribution.

Tom Burns is Field CTO for Media and Entertainment, Dell EMC. He can be reached on Twitter at @TVBurns.