Preprint: arXiv

NWT^[1] is an end-to-end model that synthesizes video from audio. We trained our model on a dataset of John Oliver speaking, mostly looking at a camera at his desk on Last Week Tonight. John Oliver is a unique subject with a large variety of gestures and expressions. The dataset covers several years of his show, with noticeable variation in clothes, seating position, lighting, camera, hair, and age. NWT is capable of generating natural-looking videos, and has the ability to control some labelled and unlabelled variation.

VQ-VAE¹ shows that free-form video generation is possible, we extend these findings to show that we can condition such a video generator on external inputs

Overview

Like VQ-VAE, NWT is made of two models: a video autoencoder to create a quantized latent representation, and an autoregressive prior model to generate new videos. We frame the autoregressive model as an encoder-decoder model that converts an audio sequence to a sequence of quantized latent video frames. We also design NWT to be able to control labelled and unlabelled attributes of the videos.

Results

NWT is capable of generating natural videos with a significant expressive range. The videos pair well with the audio input. We can also control a number of aspects of the generation process, which we demonstrate in the next section.
All generated clips shown are model outputs with no post-processing or special tricks. None of these were seen by the model during training time.

Random samples

We select random audio inputs from the test set and generate corresponding video clips. We show that the model's choices tightly correlate with the intonation, tone of voice, and speech contents. The model learns nearly perfect lip syncronization, as well as expressive behaviours such as head movement, facial expressions, body language, and gestures.

Episode Control

Given the diversity of the data, we designed our model to be able to generate visual characteristics matching specified episodes, independently from the audio.

Audio-to-Video

In addition to the obvious visual change in clothing, hair, age, etc., NWT also matches John Oliver's general behaviour to the episode specified. This is due to the imbalance of the dataset, where some episodes feature much more dynamic behaviour than others.

ol-1ZAPwfrtAFY	ol-DnpO_RTSNmQ	ol-hkZir1L7fSY	ol-Ylomy1Aw9Hk

Video-to-video

We can also control the episode at the video-to-video model level independently of content i.e. changing John Oliver's appearance and episode colour features without impacting anything else in the video (his behaviour, camera position, background, etc.)

ol--YkLPxQp_y0	ol-0UjpmT5noto	ol-eAFnby2184o	ol-nh0ac5HUpDU

Mid-sample quick change

To experiment with the limits of audio-to-video episode control, we vary the episode identifier several times while generating one clip. Even though this was not an explicit task during training and cannot happen in reality, NWT maintains lipsync, motion, and location coherency.

Style control

We describe style as any features of the video that are weakly correlated to the audio. This includes variations in camera position and post-production overlays which are very difficult to label in video datasets. Although episode control could be perceived as part of the style, we presented it separately due to its interpretability. In this work episodes were labelled while all other aspects of style were not, however we suspect the model could learn episode-related style without annotation.

Single dimension control

We manually change one dimension of our style representation inside the model and we observe that NWT learned human-interpretable features, such as camera position:

Vertical Camera Angle: Up -> Down

Horizontal Camera Angle: Left -> Right

Camera Zoom: Zoom -> Unzoom

Combination: Top Left zoom -> Bottom right unzoom

Copy style from reference

NWT can copy style from reference videos independently from the audio. The model is capable of copying abstract traits from videos, even if they are from a different episode. This traits are often modified relative to a "default" of the episode, for example moving the camera some amount in a particular direction relative to the episode's standard position.

In the next samples, the first row shows the videos used as source to copy the style from.

Random style sampling

For completeness we randomly sample multiple times from the learned style representation for a few audio clips. This provides a sense of the scope of traits learned by style.

Ground truth comparisons

To show the difference between the model's generated video and the ground truth, we generate each audio clip twice and show it to the corresponding real video clip. NWT frequently makes different but plausible gestures than in the original videos.

Inference 1	Inference 2	Ground Truth

Video compression and reconstruction

A bi-product of the video-to-video portion of NWT is its ability to compress and reconstruct video within the domain it was trained on. We show the efficiency of this model by showing ground truth general H264 compression and our model's reconstructions. Despite that our models can do both 2x and 4x compression compared to h264, the reconstructions are still perceptually comparable.

We put the compression rate relative to uncompressed data in parentheses.

Ground Truth (1x)	H.264 (200.1x)	NWT (396.1x)	NWT (803.2x)

Our approach

As mentioned above, our approach consists of two separately trained models. The first is a video variational autoencoder that unsupervisedly compresses and reconstructs video clips. We constrain the latent space to be a combination of categorical distributions. The second is a encoder-decoder model that takes an audio sequence and predicts the discrete latent video representations.

Our discretization approach is done with a new gradient approximation technique based on attention mechanisms^2,³. We also take inspiration from controllable expressiveness of text-to-speech models ⁴, and train our models to be able to control annotated and unannotated features of the videos. More details are available in the paper.

Related work

There are a number of existing approaches to generating video of a talking person from speech input. In general, previous approaches make use of either an engineered intermediate representation such as pose keypoints, or real reference frames as part of the synthesis input, or both.

Speech2vid⁵ was the first work to use raw audio data to generate a talking head, modifying frames from a reference image or video. Synthesizing Obama⁶ demonstrated substantially better perceptual quality, trained on a single subject. It uses a network to predict lip shape, which is used to synthesize mouth shapes, which are recomposed into video frames using reference frames. Neural Voice Puppetry aims to be quickly adaptable to multiple subjects with a few minutes of additional video data⁷. It accomplishes this by using a non-subject specific 3D face model as an intermediate representation, followed by a rendering network that can be quickly tuned to produce subject-specific video. Wav2Lip⁸ puts emphasis on being content-neutral; using reference frames it can produce compelling lipsync results on unseen video. This also allows it to include hand movement and body movement. Speech2Video⁹ (not to be confused with the similarly named Speech2vid mentioned above) uses pose as its intermediate representation and a labelled pose dictionary for each subject.

Engineered intermediate representations have the problem of constraining the output space. The video rendering stage must make many assumptions from intermediate representations with low information content, in order to construct expressive motion. Small differences in facial expression can communicate highly significant expressive distinctions¹⁰, resulting in important expressive variations being represented with very little change in most intermediate representations of a face. As a result, even if those differences are actually highly correlated with learned features from the audio input, the video rendering stage must accomplish the difficult task of detecting them from a weak signal.

NWT is distinguished by its end-to-end approach that does not require domain knowledge. This would allow it to be generalized to many other audio-to-video tasks in the future.

Footnotes:

[1] NWT is an acronym of Next Week Tonight and is pronounced "newt". ↩

References:

1. van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning, 2018.↩

2. Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neuralmachine translation, 2015.↩

3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2017.↩

4. Hsu, W.-N., Zhang, Y., Weiss, R., Zen, H., Wu, Y., Cao, Y., and Wang, Y. Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rygkk305YQ. ↩

5. Chung, J. S., Jamaludin, A., and Zisserman, A. You said that? In British Machine Vision Conference, 2017. ↩

6. Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph., 36(4), July 2017. ISSN 0730-0301. doi: 10.1145/ 3072959.3073640. URL https://doi.org/10.1145/3072959.3073640.↩

7. Thies, J., Elgharib, M. A., Tewari, A., Theobalt, C., and Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In ECCV, 2020.↩

8. Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., and Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. Proceedings of the 28th ACM International Conference on Multimedia, Oct 2020. doi: 10.1145/3394171.3413532. URL http://dx.doi. org/10.1145/3394171.3413532.↩

9. Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., and Yang, R. Speech2video synthesis with 3d skeleton regularization and expressive body poses, 2020.↩

10. Olszanowski, M., Pochwatko, G., Kuklinski, K., Scibor-Rylski, M., Lewinski, P., and Ohme, R. K. Warsaw set of emotional facial expression pictures: a validation study of facial display photographs. Frontiers in Psychology, 5:1516, 2015. ISSN 1664-1078. doi: 10.3389/fpsyg.2014. 01516. URL https://www.frontiersin.org/article/10.3389/fpsyg.2014.01516.↩

Authors:

Rayhane Mama, Marc S. Tyndel, Hashiam Kadhim, Cole Clifford, Ragavan Thurairatnam

Acknowledgments:

The authors would like to thank Alex Krizhevsky for his mentorship and insightful discussions. The authors also thank Alyssa Kuhnert, Aydin Polat, Joe Palermo, Pippin Lee, Stephen Piron, and Vince Wong for feedback and support.

License:

This works is purely done for research purposes. We do not own the rights to the original Last Week Tonight videos.
The dataset was created using content downloaded from the official HBO Last Week Tonight youtube channel.
The original videos used in this project fall under the Youtube standard license.

NWT: Towards natural audio-to-video generation
with representation learning

29 minute read