NWT: Towards natural audio-to-video generation with representation learning

Note: This page contains the unaltered outputs of NWT models.
For convenience, we provide a higher resolution version of the samples from this page here.
The upscaling happens with a super resolution neural network so the upscaled videos can have some minor details different from our actual models' outputs.
If you care about seeing NWT's samples without any modifications, we recommend watching the videos on this page.
 
Abstract: In this work we introduce NWT, an expressive speech-to-video model. Unlike approaches that use domain-specific intermediate representations such as pose keypoints, NWT learns its own latent representations, with minimal assumptions about the audio and video content. To this end, we propose a novel discrete variational autoencoder with adversarial loss, dVAE-Adv, which learns a new discrete latent representation we call Memcodes. Memcodes are straightforward to implement, require no additional loss terms, are stable to train compared with other approaches, and show evidence of interpretability. To predict on the Memcode space, we use an autoregressive encoder-decoder model conditioned on audio. Additionally, our model can control latent attributes in the generated video that are not annotated in the data. We train NWT on clips from HBO’s Last Week Tonight with John Oliver. NWT consistently scores above other approaches in Mean Opinion Score (MOS) on tests of overall video naturalness, facial naturalness and expressiveness, and lipsync quality. This work sets a strong baseline for generalized audio-to-video synthesis. Samples are available at https://next-week-tonight.github.io/NWT/.
 
Note: To best enjoy the samples, we strongly recommend to wear headphones and ensure the screen has good brightness. Some bluetooth headphones cause the audio to play in delay on this page, we recommend you pick the proper audio and visual setup that plays the following target (not generated) video perfectly, with flawless lip sync and good visibility.


All of the below samples are generated from the test set audio, and were not seen by the model during training.

Contents

 

S1. NWT's random samples (section 3.3)

S1.1 Random MAR samples

We present random samples generated by the MAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style latent variable.
         

 

S1.2 Random FAR samples

We present random samples generated by the FAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style latent variable.
         

 

S2. Episode control (section 3.4)

S2.1 Audio to Video (Bottom level)

We select random audio samples from the test set and we generate the video for each audio several times while changing the episode index that we give the model.
In each regeneration experiment, we re-use the same style vector randomly sampled from the style latent.
We use the FAR model to generate these samples.
                               
Episode =ol-1ZAPwfrtAFYol-DnpO_RTSNmQol-hkZir1L7fSYol-Ylomy1Aw9Hk
ol-1ZAPwfrtAFY
ol-DnpO_RTSNmQ
ol-hkZir1L7fSY
ol-Ylomy1Aw9Hk

S2.2 Video to Video (Top Level)

We feed real LWT videos to the dVAE-Adv's encoder, and we modify the episode embedding of the dVAE-Adv model then decode the video, resulting in a color change of the "color filters" of the video. In contrast to Audio to Video episode control, this preserves all video details including movements and background position.
We denote on the y-axis the episode id of the original input video and on the x-axis, the episode id we want to convert the video to.
                               
Episode =ol--YkLPxQp_y0ol-0UjpmT5notool-eAFnby2184ool-nh0ac5HUpDU
ol--YkLPxQp_y0
ol-0UjpmT5noto
ol-eAFnby2184o
ol-nh0ac5HUpDU

S2.3 Mid sample quick change

We select random audio samples from the test set and we generate the video for each audio while modifying the episode index 4 times during the generation. The model is tasked to figure out the transitions between episodes and continue the generation by itself.
The style vector is randomly sampled from the style latents.
We use the FAR model to generate these samples.
       
Episode Sequence 1
Episode Sequence 2
Episode Sequence 3
Episode Sequence 4

 

S3. Style control (section 3.5)

S3.1 latent variate control

We select random audio samples from the test set and we generate the video for each audio multiple times while changing a single (or few when specified) dimension of the latent gaussian attribute representation of style.
Some variates end up being easily interpretable such as dimensions 12, 13 and 15 while some others are more ambiguous like dimension 5.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.

Dimension 12 (Vertical Camera Angle: Up -> Down)

                 
Val =μ - 2σμμ + 2σ
Sample 1
Sample 2

Dimension 13 (Horizontal Camera Angle: Left -> Right)

                 
Val =μ - 2σμμ + 2σ
Sample 1
Sample 2

Dimension 15 (Camera Zoom: Zoom -> Unzoom)

                 
Val =μ - 2σμμ + 2σ
Sample 1
Sample 2

Dimension 5

                 
Val =μ - 2σμμ + 2σ
Sample 1
Sample 2

Dimension 12 + 13 + 15 (Camera Angle and zoom: Top Left zoom -> Bottom right unzoom)

                 
Val =μ - 2σμμ + 2σ
Sample 1
Sample 2

S3.2 Copy style from reference

We select random audio samples from the test set and we generate the video for each audio multiple times while changing the reference video used to generate a style input input.
The top row is the reference video input of the model, encoded and decoded by the dVAE-Adv.
We use the episode id of the audio clip, independently of the reference video.
We use reference samples from both the same episode and different episodes to show that style is transferred relatively to the generated episode.
We use the FAR model to generate these samples.
                             
Reference =
Sample 1
Sample 2

S3.3 Random style sampling

We select random audio samples from the test set and we generate the video for each audio multiple times while randomly sampling the style vector from the standard normal prior.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.
                 
Sample 1
Sample 2
Sample 3

 

S4. FAR vs MAR vs Ground Truth parallel comparison (section 3.3)

We present random samples generated by both the MAR model and the FAR model. We only provide each model with the input audio and the episode index, and we randomly sample from the style latent variable.
We also show the ground truth videos as a comparison between what the model is doing and what the real John Oliver did in his show.
                                             
Model =FARMARGround Truth
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10

 

S5. Video compression and reconstruction (section 3.6)

We present random samples compressed and reconstructed by the dVAE-Adv. We provide the model with the episode id of the input video.
We feed the same video to both our 16x14 model (Mem16) and our 8x7 model (Mem8), and we show the reconstruction of each model.
We also show the Ground truth video before any compression and the compressed and restored video with h264.
                                                   
Sample 1Sample 2Sample 3Sample 4Sample 5Sample 6
Ground Truth
h264
Mem16
Mem8


The dataset was created using content downloaded from the official Last Week Tonight youtube channel.
The original videos used in this project fall under the Youtube standard license.