NWT: Towards natural audio-to-video generation with representation learning
Note:
This page contains samples of our NWT's results upsampled with an external super resolution model for convenience.
The videos on this page may show small differences with our NWT's outputs. The latter can be seen here.
The super resolution model used on this page can be found on this github repository.
Abstract: In this work we introduce NWT, an expressive speech-to-video model. Unlike approaches that use domain-specific intermediate representations such as pose
keypoints, NWT learns its own latent representations, with minimal assumptions about the audio and video content. To this end, we propose a novel discrete variational
autoencoder with adversarial loss, dVAE-Adv, which learns a new discrete latent representation we call Memcodes. Memcodes are straightforward to implement, require no additional
loss terms, are stable to train compared with other approaches, and show evidence of interpretability. To predict on the Memcode space, we use an autoregressive encoder-decoder
model conditioned on audio. Additionally, our model can control latent attributes in the generated video that are not annotated in the data. We train NWT on clips from
HBO’s Last Week Tonight with John Oliver. NWT consistently scores above other approaches in Mean Opinion Score (MOS) on tests of overall video naturalness, facial
naturalness and expressiveness, and lipsync quality. This work sets a strong baseline for generalized audio-to-video synthesis. Samples are available at
https://next-week-tonight.github.io/NWT/.
Note:To best enjoy the samples, we strongly recommend to wear headphones and ensure the screen has good brightness. Some bluetooth headphones
cause the audio to play in delay on this page, we recommend you pick the proper audio and visual setup that plays
the following target (not generated) video perfectly, with flawless lip sync and good visibility.
All of the below samples are generated from the test set audio, and were not seen by the model during training.
We present random samples generated by the MAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style
latent variable.
We present random samples generated by the FAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style
latent variable.
We select random audio samples from the test set and we generate the video for each audio several times while changing the episode index that we give the model.
In each regeneration experiment, we re-use the same style vector randomly sampled from the style latent.
We use the FAR model to generate these samples.
We feed real LWT videos to the dVAE-Adv's encoder, and we modify the episode embedding of the dVAE-Adv model then decode the video, resulting in a color change of the
"color filters" of the video. In contrast to Audio to Video episode control, this preserves all video details including movements and background position.
We denote on the y-axis the episode id of the original input video and on the x-axis, the episode id we want to convert the video to.
We select random audio samples from the test set and we generate the video for each audio while modifying the episode index 4 times during the generation. The model is
tasked to figure out the transitions between episodes and continue the generation by itself.
The style vector is randomly sampled from the style latents.
We use the FAR model to generate these samples.
We select random audio samples from the test set and we generate the video for each audio multiple times while changing a single (or few when specified) dimension of the latent
gaussian attribute representation of style.
Some variates end up being easily interpretable such as dimensions 12, 13 and 15 while some others are more ambiguous like dimension 5.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.
We select random audio samples from the test set and we generate the video for each audio multiple times while changing the reference video used to generate a style input
input.
The top row is the reference video input of the model, encoded and decoded by the dVAE-Adv.
We use the episode id of the audio clip, independently of the reference video.
We use reference samples from both the same episode and different episodes to show that style is transferred relatively to the generated episode.
We use the FAR model to generate these samples.
We select random audio samples from the test set and we generate the video for each audio multiple times while randomly sampling the style vector from the standard normal
prior.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.
We present random samples generated by both the MAR model and the FAR model. We only provide each model with the input audio and the episode index, and we randomly sample
from the style latent variable.
We also show the ground truth videos as a comparison between what the model is doing and what the real John Oliver did in his show.
We present random samples compressed and reconstructed by the dVAE-Adv. We provide the model with the episode id of the input video.
We feed the same video to both our 16x14 model (Mem16) and our 8x7 model (Mem8), and we show the reconstruction of each model.
We also show the Ground truth video before any compression and the compressed and restored video with h264.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Ground Truth
h264
Mem16
Mem8
The dataset was created using content downloaded from the official Last Week Tonight youtube channel.
The original videos used in this project fall under the Youtube standard license.