NWT: Towards natural audio-to-video generation with representation learning

S1. NWT's random samples (section 3.3)

S2. Episode control (section 3.4)

S3. Style control (section 3.5)

S4. FAR vs MAR vs Ground Truth parallel comparison (section 3.3)

S5. Video compression and reconstruction (section 3.6)

S1.1 Random MAR samples

S1.2 Random FAR samples

S2.1 Audio to Video (Bottom level)

S2.2 Video to Video (Top Level)

S2.3 Mid sample quick change

S3.1 latent variate control

S3.2 Copy style from reference

S3.3 Random style sampling

Dimension 12 (Vertical Camera Angle: Up -> Down)

Dimension 13 (Horizontal Camera Angle: Left -> Right)

Dimension 15 (Camera Zoom: Zoom -> Unzoom)

Dimension 5

Dimension 12 + 13 + 15 (Camera Angle and zoom: Top Left zoom -> Bottom right unzoom)

Note: This page contains samples of our NWT's results upsampled with an external super resolution model for convenience.
The videos on this page may show small differences with our NWT's outputs. The latter can be seen here.
The super resolution model used on this page can be found on this github repository.

Abstract: In this work we introduce NWT, an expressive speech-to-video model. Unlike approaches that use domain-specific intermediate representations such as pose keypoints, NWT learns its own latent representations, with minimal assumptions about the audio and video content. To this end, we propose a novel discrete variational autoencoder with adversarial loss, dVAE-Adv, which learns a new discrete latent representation we call Memcodes. Memcodes are straightforward to implement, require no additional loss terms, are stable to train compared with other approaches, and show evidence of interpretability. To predict on the Memcode space, we use an autoregressive encoder-decoder model conditioned on audio. Additionally, our model can control latent attributes in the generated video that are not annotated in the data. We train NWT on clips from HBO’s Last Week Tonight with John Oliver. NWT consistently scores above other approaches in Mean Opinion Score (MOS) on tests of overall video naturalness, facial naturalness and expressiveness, and lipsync quality. This work sets a strong baseline for generalized audio-to-video synthesis. Samples are available at https://next-week-tonight.github.io/NWT/.

Note: To best enjoy the samples, we strongly recommend to wear headphones and ensure the screen has good brightness. Some bluetooth headphones cause the audio to play in delay on this page, we recommend you pick the proper audio and visual setup that plays the following target (not generated) video perfectly, with flawless lip sync and good visibility.

All of the below samples are generated from the test set audio, and were not seen by the model during training.

Contents

S1. NWT's random samples (section 3.3)

S1.1 Random MAR samples
S1.2 Random FAR samples

S2. Episode control (section 3.4)

S2.1 Audio to Video (Bottom level)
S2.2 Video to Video (Top level)
S2.3 Mid sample quick change

S3. Style control (section 3.5)

S3.1 Single variate control
S3.2 Copy style from reference
S3.3 Random style sampling

S4. FAR vs MAR vs Ground Truth parallel comparison (section 3.3)
S5. Video compression and reconstruction (section 3.6)

We present random samples generated by the MAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style latent variable.

We present random samples generated by the FAR model. We only provide the model with the input audio and the episode index, and we randomly sample from the style latent variable.

We select random audio samples from the test set and we generate the video for each audio several times while changing the episode index that we give the model.
In each regeneration experiment, we re-use the same style vector randomly sampled from the style latent.
We use the FAR model to generate these samples.

Episode =	ol-1ZAPwfrtAFY	ol-DnpO_RTSNmQ	ol-hkZir1L7fSY	ol-Ylomy1Aw9Hk
ol-1ZAPwfrtAFY
ol-DnpO_RTSNmQ
ol-hkZir1L7fSY
ol-Ylomy1Aw9Hk

We feed real LWT videos to the dVAE-Adv's encoder, and we modify the episode embedding of the dVAE-Adv model then decode the video, resulting in a color change of the "color filters" of the video. In contrast to Audio to Video episode control, this preserves all video details including movements and background position.
We denote on the y-axis the episode id of the original input video and on the x-axis, the episode id we want to convert the video to.

Episode =	ol--YkLPxQp_y0	ol-0UjpmT5noto	ol-eAFnby2184o	ol-nh0ac5HUpDU
ol--YkLPxQp_y0
ol-0UjpmT5noto
ol-eAFnby2184o
ol-nh0ac5HUpDU

We select random audio samples from the test set and we generate the video for each audio while modifying the episode index 4 times during the generation. The model is tasked to figure out the transitions between episodes and continue the generation by itself.
The style vector is randomly sampled from the style latents.
We use the FAR model to generate these samples.

Episode Sequence 1
Episode Sequence 2
Episode Sequence 3
Episode Sequence 4

We select random audio samples from the test set and we generate the video for each audio multiple times while changing a single (or few when specified) dimension of the latent gaussian attribute representation of style.
Some variates end up being easily interpretable such as dimensions 12, 13 and 15 while some others are more ambiguous like dimension 5.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.

Val =	μ - 2σ	μ	μ + 2σ
Sample 1
Sample 2

Val =	μ - 2σ	μ	μ + 2σ
Sample 1
Sample 2

Val =	μ - 2σ	μ	μ + 2σ
Sample 1
Sample 2

Val =	μ - 2σ	μ	μ + 2σ
Sample 1
Sample 2

Val =	μ - 2σ	μ	μ + 2σ
Sample 1
Sample 2

We select random audio samples from the test set and we generate the video for each audio multiple times while changing the reference video used to generate a style input input.
The top row is the reference video input of the model, encoded and decoded by the dVAE-Adv.
We use the episode id of the audio clip, independently of the reference video.
We use reference samples from both the same episode and different episodes to show that style is transferred relatively to the generated episode.
We use the FAR model to generate these samples.

Reference =
Sample 1
Sample 2

We select random audio samples from the test set and we generate the video for each audio multiple times while randomly sampling the style vector from the standard normal prior.
We use the episode id of the audio clip independently from the style vector variation.
We use the FAR model to generate these samples.

Sample 1
Sample 2
Sample 3

We present random samples generated by both the MAR model and the FAR model. We only provide each model with the input audio and the episode index, and we randomly sample from the style latent variable.
We also show the ground truth videos as a comparison between what the model is doing and what the real John Oliver did in his show.

Model =	FAR	MAR	Ground Truth
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10

Ground Truth

We present random samples compressed and reconstructed by the dVAE-Adv. We provide the model with the episode id of the input video.
We feed the same video to both our 16x14 model (Mem16) and our 8x7 model (Mem8), and we show the reconstruction of each model.
We also show the Ground truth video before any compression and the compressed and restored video with h264.

The dataset was created using content downloaded from the official Last Week Tonight youtube channel.
The original videos used in this project fall under the Youtube standard license.