SEATTLE—Researchers at the University of Washington have created video of President Obama saying something he did not say. Using the former president as a subject, Supasorn Suwajanakorn, Steven M. Seitz and Ira Kemelmacher-Shlizerman synthesized a “high-quality video of him speaking with accurate lip sync, composited into a target video clip.”
They described the process as follows in their article, “Synthesizing Obama: Learning Lip Sync from Audio,” published in this month in ACM Transactions on Graphics:
“Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high-quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.”
The team said they selected Obama for several reasons, including an “abundance” of publicly available high-definition video footage from his weekly addresses comprising 17 hours and 2 million frames over eight years, plus the consistency of lighting and other production variables as well as the former president’ typically “serious and direct tone.”
Creating an otherwise realistic lip-sync “due in part to the technical challenge of mapping from a one-dimensional signal to a 3D time-varying image, but also due to the fact that humans are extremely attuned to subtle details in the mouth region; many previous attempts at simulating talking heads have produced results that look uncanny.”
Implications for actual fake news notwithstanding, the researchers said the technology could have beneficial applications in the way of reducing transmission bandwidth, creating video synthesis that enables lip-reading from audio, and affecting avatars in special effects and games.
The researchers also prepared a presentation for SIGGRAPH 2017.