SMPTE 2019: Neural Networks Hold Promise for VFX Auto-Rotoscoping

LOS ANGELES—AI and machine learning offer the manufacturing and transportation sectors a means to new levels of performance. But can some of the same tools—specifically those that power computer vision—be used to remove the drudgery and long hours VFX artists face when it comes to rotoscoping?

Oscar Estrada, a recent graduate of the Rochester Institute of Technology’s Motion Picture Science program, was determined to find out.

Oscar Estrada

Oscar Estrada

Estrada presented his research Oct. 21, the opening day of the SMPTE 2019 Annual Technical Conference and Exhibition at the Westin Bonaventure Hotel in downtown Los Angeles. In his presentation, “Rotoscope Automation with Deep Learning,” Estrada laid out his thesis research looking at whether the use of a convolutional neural network could be used to extract a person or persons from a video clip without human intervention.

Convolutional neural networks are particularly well-suited to the task of rotoscoping as opposed to other neural network techniques, especially when large images like 4K are at play, he said.

Arranged in “a more image friendly structure,” convolutional neural networks connect a virtual neuron in an image layer to a local space within the image, he explained. Multiple filters, each analyzing an image to extract different information, can be applied.

Estrada, in consultation with his advisor and others on his research team, chose the Xception convolutional network, an open-source architecture developed by Google that has shown itself to perform well on challenges faced with classifying images.

For Estrada’s research, Xception was set up to look at images at different resolutions “because it has been proven that lower resolutions and low frequency content can provide semantic meaning, which is very useful for classification,” he said. Analyzing at high frequencies provided spatial information that enabled Xception to identify “precisely where…[a] person… [was] located within the image and provide[d] accurate segmentation,” said Estrada.

To train Estrada’s implementation of Xception, Imagenet, an image database of 14 million images, and Coco, a dataset of some 46,000 images of people, were used.

Estrada put his convolutional neural network up against Roto Brush, an industry standard offered in After Effects for rotoscoping, and Ground Truth, a popular tool used by VFX artists. Students at the Rochester Institute of Technology were recruited to use Ground Truth to rotoscope scenes to provide a comparison for Estrada’s research. Images for the test were recorded in ProRes using a Sony FS5 Mark II and an Arri Alexa Mini at 1080p, 24 frames per second.

During his presentation, Estrada showed some of the shots used for his test while laying out his results. They included a wide indoor shot of a person walking across the frame; a medium shot in which the subject was asked to rotate one way and the other; an overexposed outdoor scene with two people with different skin colors; and a clip of someone walking down a street.

Overall, the convolutional neural network did “pretty well” outlining people in the various shots.

Results varied from test to test and tool to tool. For example, the medium shot of the person walking through the frame proved challenging for the After Effects Roto Brush tool. Estrada chose the frame that the tool performed at its peak and propagated frames from it. While the Adobe tool had some “propagating issues,” the neural network “performed pretty consistently” but had “issues in dark areas,” such as shoes and areas of a jacket that was worn.

In the case of the shot with rotation, After Effects was more consistent and performed better throughout, while the neural network struggled again in dark areas of the jacket that was worn, he said.

Estrada set up the overexposed shot specifically to see how the convolution neural network would respond to an outdoor shot and to people with different skin colors.

“The network correctly identified the people, and it performed very well with a matte around them,” said Estrada. However, trees and a portion of a parking lot in the background of the scene proved troublesome for the network. However, these types of anomalies could easily be removed manually at a later time, he said.

The Adobe After Effects tool performed well in this test, he added.

Similarly, in the shot of the person walking down the street away from the camera Estrada’s convolutional neural network performed well; however, it once again had trouble with some dark areas and with the hands of the subject, he said.

However, Estrada’s reports of small deficiencies of the convolutional neural network should not obstruct the bigger picture. Overall, the network did a good job of automatically separating subjects from the background.

“It’s worth noting that it took the students [using Ground Truth] between two and 10 hours … to produce the matte,” he said. “The network did it in less than 10 seconds.”

This huge time savings “could be utilized for the artist to refine the network output rather than starting from scratch and spending those two to 10 hour per second of video,” he said.

In conclusion, Estrada told the audience gathered for his presentation that his convolutional neural network performed to his expectation and that this approach to rotoscoping holds promise for the future.

Editor’s note: Oscar Estrada is the recipient of SMPTE’s 2019 Student Paper Award.

Phil Kurz

Phil Kurz is a contributing editor to TV Tech. He has written about TV and video technology for more than 30 years and served as editor of three leading industry magazines. He earned a Bachelor of Journalism and a Master’s Degree in Journalism from the University of Missouri-Columbia School of Journalism.