Making Use of Useless Data

Dateline 2014—at the time, the “digital universe” was growing at a phenomenal 40 percent annually and expected to continue “on into the next decade.” At the time, that growth rate reflected conglomerate sets of data that not only included people and enterprise, but included the relatively new term “Internet of Things (IoT).”

To a broadcast engineer, the term IoT used to mean “inductive output tube”—an alternative to the klystron, and both referencing transmitting tubes used in high-power TV transmitters, the latter in analog television and the former a most cost-effective device, which emerged full strength during the ATSC transition.

The modern day IoT may have equally as broad an impact for society as it did for the digital TV broadcast marketplace. The Internet of Things has propelled storage demands and solutions (including the object store) into the next universe, aiding and changing the perspective and dimensions of “big data” forever.

COMPREHENDING THE ZETTABYTE ERA

When the IDC conducted its study in 2014, they predicted the volume of unstructured data created and copied all over the world would reach 44 zettabytes (1 zettabyte = 2 to the 70th power bytes), i.e., 44 trillion GB, annually, by 2020. By perspective, just a year before that 2014 IDC prognostication, the amount of data created and stored in 2013 sat at a mere 4.4 trillion GB per year. If correct, the amount of data growth is outpacing Moore’s Law, and will increase tenfold in six years.

Ironically, according to that IDC report, the amount of useful data (if tagged and analyzed) grew by a much lesser amount. In 2013, only 22 percent of the data accumulated in the digital universe was considered “useful”—that is, it was relevant because it was meaningfully tagged or categorized and was searchable and retrievable.

By the year 2020, the IDC prediction reported in April 2014 stated that only 37 percent of the data collected will be useful because of that same criteria.

USELESS DATA RETENTION

So why do we continue to store data that isn’t useful? The simple answer: “Because we can.”

Irrespective of how, where, when or why we create this mass of data, we find that most companies, enterprises or individuals collect and save literally everything because, fundamentally, there isn’t the time to sort, catalog or even physically hit the delete key once the data is collected. On the personal level, think of how many VHS tapes or compact discs or DVDs you still have in boxes or on shelves in the basement or the attic.

Putting those collections into today’s perspective, all those memories are essentially just another set of data. If we digitized all those analog VHS tapes into compressed ones and zeros, we’d still have enormous sets of data that would likely remain unmanageable, ignored and probably lost in the digital quagmire of never-never land.

At least while in a tape format there was a storage container (the wrapper), information about the content (the metadata) and an easy methodology to catalog the content by orderly arrangements on shelves, boxes or with a 3x5 card catalog or even a digital picture of the box.

EXPONENTIAL EXPANSION

Production companies, news organizations, broadcasters all generate enormous amounts of data. The volumes continue to expand exponentially and will likely end up in the “no-where’s-land” of the digital landfill. For today, this enterprise digital repository is now an ambiguous, unknown depot that might be one of many ubiquitous “clouds”—some on premises, some in that atomic number 26 mountain place, some in a public cloud, and a lot more of it ending up in privately managed datacenters scattered around the globe.

For how long and what purpose do organizations intend to keep that data? It’s relatively inexpensive to hide those bits in a cloud and nearly zero cost to keep it there—until you want to retrieve it. However, to get meaningful use out of those bits, you needed to catalog it. Otherwise, you must pull it all down from the cloud, store it again (locally) and then search through it to find something usable.

For an enterprise of any size, this takes labor—which costs money. And that’s a resource that doesn’t grow automatically, like the data you and your friends and their friends are generating every second of the day.

INTELLIGENT DATA STORAGE

When you consider the daily couple of billion pieces of data “about” you, your friends and their friends, too, you can see the storage challenges which entities like Facebook, Google, Amazon, and the other social media or shopping platforms have on their hands. The difference is these companies have figured out how to intelligently collect the data, identifying each piece using artificial intelligence algorithms that are, incidentally, developed either by their own organization or acquired by buying another company with that expertise. Across each social group, they will cross-relationship every piece of their data and then store it in one-to-many of their “private” clouds—which are liberally dispersed data centers interconnected by networks based upon volumetric accessibility per region.

Their data is never stored just once. Instead, it is replicated multiple times for accessibility, protection and resiliency. How each organization diagnostically and dynamically protects that information and makes it nearly instantly retrievable is their secret sauce.

Yet today, some of the concepts and principles which social media companies have developed for their own applications are now becoming available to individuals and organizations. The goal in these products is to start diminishing the “uselessness” of the data by applying intelligent metadata that can then utilize more conventional search engine approaches for cataloging and retrieving those assets. These new AI-based approaches now differentiate the future from the more traditional legacy media asset management methodologies.

STRUCTURING THE UNSTRUCTURED

What we’ve learned by collecting huge sets of information about known places around the world is now supporting machine learning techniques that create accurate metadata tagged not just to a single image, but to an entire generation of data sets grouped as objects. Such information may use the angle of a shadow which then identifies a time of day, which, when coupled to a geographic (GPS) location, gives more information about the season or the atmospheric conditions. People in images can now be related to their siblings or parents, based upon data sets generated from favorites or albums. Road signs, window lettering on buildings, and other distinguishing characteristics add to the databases about the actual surroundings where that image, and those of others, were collected. What was heretofore considered useless information is now branded and repurposed by machines which “look” for this data and then catalog it without any direct human intervention.

Using these new autonomous techniques, every time a new piece of content (still image, sound or video) enters a system equipped with these technologies, the system turns that previously “unstructured” data into “structured” data that is then cataloged not just as a single image, but as collections of data sets bound into a global storage platform.

These are the roots of where we’re headed as the future of storage becomes an indirect, unsuspecting model that makes potentially useless data valuable again, for all.

Karl Paulsen is CTO at Diversified (www.diversifiedus.com) and a SMPTE Fellow. Read more about this and other storage topics in his book “Moving Media Storage Technologies.” Contact Karl atkpaulsen@diversifiedus.com.

TOPICS