is an end-to-end model that synthesizes video
We trained our model on a dataset of John Oliver speaking, mostly looking at a camera
at his desk on Last Week Tonight. John Oliver is a unique subject
with a large variety of gestures and expressions.
The dataset covers several years of his show, with noticeable variation
in clothes, seating position, lighting, camera, hair, and age.
NWT is capable of generating natural-looking videos, and
has the ability to control some labelled and unlabelled variation.
VQ-VAE1 shows that free-form video generation is
possible, we extend these
findings to show that we can condition such a video generator on external inputs
Like VQ-VAE, NWT is made of two models: a video autoencoder to create a
quantized latent representation, and an autoregressive prior model to
generate new videos. We frame the autoregressive model as an encoder-decoder
model that converts an audio sequence to a sequence of quantized latent
We also design NWT to be able to control labelled and unlabelled attributes
of the videos.
NWT is capable of generating natural videos with a significant
The videos pair well with the audio input.
We can also control a number of aspects of the generation process, which we
demonstrate in the next section.
All generated clips shown are model outputs with no post-processing or
special tricks. None of these were seen by the model during training time.
We select random audio inputs from the test set and generate corresponding video clips.
We show that the model's choices tightly correlate with the intonation, tone of voice, and
speech contents. The model learns nearly perfect lip syncronization, as well as expressive
behaviours such as head movement, facial expressions, body language, and gestures.
In addition to the obvious visual change in clothing, hair, age, etc., NWT also
matches John Oliver's general behaviour to the episode specified. This is due to
the imbalance of the dataset, where some episodes feature much more dynamic behaviour
We can also control the episode at the video-to-video model level independently of
content i.e. changing John Oliver's
appearance and episode colour features without impacting anything else in the video
(his behaviour, camera position, background, etc.)
To experiment with the limits of audio-to-video episode control, we vary
the episode identifier several times while generating one clip.
Even though this was not an explicit task during training and cannot happen in reality,
NWT maintains lipsync, motion, and location coherency.
We describe style as any features of the video that are weakly correlated to the audio.
This includes variations in camera position and post-production overlays which are very
difficult to label in video datasets.
Although episode control could be perceived as part of the style, we presented it separately
due to its interpretability. In this work episodes were labelled while all other aspects of style
were not, however we suspect the model could learn episode-related style without annotation.
NWT can copy style from reference videos independently from the audio.
The model is capable of copying abstract traits from videos, even if they are
from a different episode. This traits are often modified relative to a "default"
of the episode, for example moving the camera some amount in a particular direction
relative to the episode's standard position.
In the next samples, the first row shows the videos used as source to copy the style from.
To show the difference between the model's generated video and the ground truth,
we generate each audio clip twice and show it to the corresponding real video clip.
NWT frequently makes different but plausible gestures than in the original videos.
A bi-product of the video-to-video portion of NWT is its ability to compress and reconstruct video
within the domain it was trained on.
We show the efficiency of this model by showing ground truth general H264 compression
and our model's reconstructions.
Despite that our models can do both 2x and 4x compression compared to h264, the
reconstructions are still perceptually comparable.
We put the compression rate relative to uncompressed data in parentheses.
As mentioned above, our approach consists of two separately trained models.
The first is a video variational autoencoder that unsupervisedly compresses and reconstructs
video clips. We constrain the latent space to be a combination of categorical distributions.
The second is a encoder-decoder model that takes an audio sequence and predicts the discrete
Our discretization approach is done with a new gradient
approximation technique based on attention mechanisms2,3.
We also take inspiration from controllable expressiveness of text-to-speech models
4, and train
our models to be able to control annotated and unannotated features of the videos.
More details are available in the paper.
There are a number of
existing approaches to generating video of a talking person
from speech input.
In general, previous approaches make use of either an
representation such as pose keypoints, or real reference frames as
part of the synthesis
input, or both.
Speech2vid5 was the first work to use raw audio data
to generate a talking head,
modifying frames from a reference image or video.
Synthesizing Obama6 demonstrated
substantially better perceptual quality, trained on a single subject.
It uses a network to predict lip shape, which is used to synthesize mouth shapes,
which are recomposed into video frames using reference frames.
Neural Voice Puppetry aims to be quickly adaptable to multiple subjects with
a few minutes of additional video data7.
It accomplishes this by using a non-subject specific 3D face model as an
intermediate representation, followed by a rendering network that can be quickly
tuned to produce subject-specific video. Wav2Lip8 puts emphasis
on being content-neutral; using reference frames it can produce compelling
lipsync results on unseen video. This also allows
it to include hand movement and body movement. Speech2Video9
(not to be confused with the similarly named Speech2vid mentioned above) uses
pose as its intermediate representation and a labelled pose dictionary for
Engineered intermediate representations have the problem of constraining
the output space.
The video rendering stage must make many assumptions from
intermediate representations with low information content, in order to construct expressive
motion. Small differences in facial expression can communicate highly significant
expressive distinctions10, resulting in important
expressive variations being represented with very little change in most
intermediate representations of a face. As a result, even if those differences
are actually highly correlated with learned features from the audio input,
the video rendering stage must accomplish the difficult task of detecting
them from a weak signal.
NWT is distinguished by its end-to-end approach that does not require
domain knowledge. This would allow it to be generalized to many other
in the future.
 NWT is an acronym of Next Week Tonight and is pronounced "newt". ↩
1. van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural
discrete representation learning, 2018.↩
2. Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches
to attention-based neuralmachine translation, 2015.↩
3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. Attention is all you need, 2017.↩
4. Hsu, W.-N., Zhang, Y., Weiss, R., Zen, H., Wu, Y., Cao, Y., and Wang, Y. Hierarchical
generative modeling for controllable speech synthesis. In International Conference on Learning
Representations, 2019. URL https://openreview.net/forum?id=rygkk305YQ. ↩
5. Chung, J. S., Jamaludin, A., and Zisserman, A. You said that? In British Machine Vision
Conference, 2017. ↩
6. Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman, I. Synthesizing obama: Learning
lip sync from audio. ACM Trans. Graph., 36(4), July 2017. ISSN 0730-0301. doi: 10.1145/
3072959.3073640. URL https://doi.org/10.1145/3072959.3073640.↩
7. Thies, J., Elgharib, M. A., Tewari, A., Theobalt, C., and Nießner, M. Neural voice puppetry:
Audio-driven facial reenactment. In ECCV, 2020.↩
8. Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., and Jawahar, C. A lip sync expert is all
you need for speech to lip generation in the wild. Proceedings of the 28th ACM International
Conference on Multimedia, Oct 2020. doi: 10.1145/3394171.3413532. URL http://dx.doi.
9. Liao, M., Zhang, S., Wang, P., Zhu, H., Zuo, X., and Yang, R. Speech2video synthesis with 3d
skeleton regularization and expressive body poses, 2020.↩
10. Olszanowski, M., Pochwatko, G., Kuklinski, K., Scibor-Rylski, M., Lewinski, P., and Ohme,
R. K. Warsaw set of emotional facial expression pictures: a validation study of facial display
photographs. Frontiers in Psychology, 5:1516, 2015. ISSN 1664-1078. doi: 10.3389/fpsyg.2014.
01516. URL https://www.frontiersin.org/article/10.3389/fpsyg.2014.01516.↩
Rayhane Mama, Marc S. Tyndel, Hashiam Kadhim, Cole Clifford, Ragavan
The authors would like to thank Alex Krizhevsky for his mentorship
and insightful discussions. The authors also thank Alyssa Kuhnert, Aydin Polat, Joe Palermo,
Pippin Lee, Stephen Piron, and Vince Wong for feedback and support.
This works is purely done for research purposes. We do not own the rights to the original Last Week Tonight videos.
The dataset was created using content downloaded from the official HBO Last Week
The original videos used in this project fall under the Youtube standard license.