Captions, subtitling workflow should take hybrid approach

Services of regular captioning for deaf and hard-of-hearing viewers started some 30 years ago. Then, subtitling was a time-consuming and laborious process, and those of us in the vanguard often said that “in five years’ time” technology would revolutionize it and deliver huge strides in automation. It’s true to say that production time ratios (caption preparation time to program running time) have been reduced from 40:1 then to somewhere between 5:1 and 10:1 now, depending on level of difficulty and quality. Nevertheless, captioning is still regarded as a cost area that broadcasters would like to reduce, particularly as legislation mandating accessibility starts to bite and localization becomes a priority.

In late 2009, Google-owned YouTube launched its auto-caps service, an automated caption generation system based on speech recognition direct from media files. This caused a predictable flurry in the world of subtitling as people wondered just how good its automatic speech recognition technology would prove to be. In Google’s own words, “The captions will not always be perfect.” In practice, they vary from quite impressive to truly awful, and subtitlers understand only too well the reasons why.

Shortcomings of automated captions

Google has also launched an auto-timing system, which matches a supplied script to the soundtrack to improve caption quality. Again, where the soundtrack contains relatively clear speech, captioners know how useful this kind of automation tool can be. But have a band of flutes strike up, several people speaking at once or add an invasive music and effects track, and the limitations soon become apparent.

While speech and language tools have advanced significantly in the last 10 years (and Google should be applauded for aptly demonstrating that point), technology has not yet advanced to the point of complete automation. Indeed, it is still some way from it. The auto-caps service highlights the challenges inherent in unleashing automated speech processing on real-world problems. This article examines the reality for advanced caption production technologies and the relationship between automatic generation and human skills in producing high-quality, broadcast-ready content while minimizing cost.

In any semi-automated process, benefit starts to flow when the tipping point is reached, i.e., when it’s quicker to take a machine-generated version and fix it rather than to do the job manually. Transcription, timing and translation are the three areas where modern speech and language technology is being applied in an effort to deliver productivity while retaining quality. The aim is to use machine assistance for as much of the task as possible, but success lies in integrating this with the necessary process of fixing the difficult material. This means not only recognizing where automation has failed, and directing human attention to the intervention points, but also containing the effects of a problem so one difficult section does not throw the rest of the processing off course.

Handled properly, the result is a shifting of the human skills focus from “caption production” toward “directed QA.” By definition, the areas in which intervention is needed will tend to be those that are more difficult; therefore, it would be inappropriate to regard this as a deskilling process. Instead, it can be viewed as maximizing the benefit of the human skill and applying it more efficiently. The transformation can be exemplified by describing the possibilities for an assisted workflow in a little more detail for two cases: a program where a reasonable post-production script can be obtained and one where it cannot.

Production scripts are clearly useful, but they come in many different formats and layouts, and it can take considerable time and effort to strip away the extraneous information and leave the core dialog and perhaps speaker labels and timing data. Tools now exist to dramatically speed up this process by allowing the user to provide some layout “hints” in just a few seconds, which an AI system can use to filter out unwanted material and leave the useful payload. From this, by applying automated segmentation and (if required) coloring processes, a first-pass (generally untimed) caption file emerges.

Where no such script is available, the user may elect to respeak the dialog using a trained speech recognizer, rekey it or to use an auto-caps approach to automated speech recognition (ASR). The latter is likely to yield mixed results, working fairly well for clearly spoken documentary-style material and more poorly for drama, cartoons or films. It is likely that this technique will be reserved, at least at present, for genres known to perform reasonably well.

Combining ASR and scripts

A new hybrid technique is available that leverages state-of-the-art ASR with additional post-processing (plus a script) to create a well-timed and textually accurate subtitle file, as well as impressive first-pass results. With hotspot analysis to identify the areas that still need manual intervention, the attention of the user can be focused quickly on the QA priorities. Given that the preprocessing can form part of a more elaborate ingest system, the benefits to productivity are obvious.

Other benefits can be achieved by considering the management of metadata associated with different programs and genres. For example, spelling lists can be compiled for the names of people and places, and an integrated reference database can be built up that serves a team of captioners with accumulated knowledge. If a captioner has worked on the pilot of a series and has created a spelling supplement while checking it, then that list needs to be shared to the next person who works on that series, which then grows with their contributions and so on. There are three benefits: the work required is minimized in pulling together all of the reference information needed for the job; when users start to use some of the new technology tools, then the vocabulary can be seeded with the names and terms; and greater consistency of output is achieved.

Moving along to the next process with the potential to automate, translation, the application of automated tools again benefits from having access to a back catalog of same-genre material and known good translation pairs. This again emphasizes the value represented by an archive of previously subtitled material that can be used to build automation expertise and accuracy.

All these examples show how a caption and subtitle production system can be automated where desirable, but such a system should guide the human expert to areas that require full manual control. Building a shared knowledgebase is a key further benefit. All of these tools increase productivity without sacrificing quality and without devaluing the role and skills of the captioner. By deploying such a state-of-the-art closed-captioning system, broadcasters and subtitling houses can ensure they mingle the science of automation with the art of human intervention to deliver the economies required.