Contrastive Diffusion for Cross-Modal and Conditional Generation
Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan


We provide comparisons between the music samples generated from the same input video using different methods.
The first row presents the input video with the original ground-truth high-quality audio track (left), and the music reconstructed from the JukeBox top-level model (right). It is worth noting that deep-learning based, high-quality music reconstruction itself remains a challenging research problem. As shown in the example below, the JukeBox top-level model (with a hop length of 128) reconstructs music with high noise levels and low overall quality and fidelity to the original. However, to reconstruct and generate high-quality audio with a smaller hop length and less noise using the bottom-level JukeBox model (with a hop length of 8) requires significantly more computation, e.g., 3hrs for a 20-seconds music sample. In contrast, synthesizing this 4-second sample takes roughly 5 seconds on the same hardware.
The second row portrays music samples generated via the existing MIDI-based methods Foley (left) and DANCE2MUSIC (right). The pre-defined standard music synthesizers do not introduce raw audio noise, but are usually limited to simple, mono-instrumental sound, which is typically not very appropriate for complex dance videos.
The third and forth rows present music samples generated from the existing VQ-based music generation method D2M-GAN (left) and our contrastive diffusion approach (right). As shown, our method can synthesize longer music sequences with better correspondence to the input.

Left: GT audio from original video (genre: pop). Right: music reconstructed via the JukeBox top-level model.

Music samples generated using existing MIDI-based methods.

Left: music samples generated via the existing VQ-based method D2M-GAN. Right: music samples from our contrastive diffusion model.

Generated Samples for AIST++ and TikTok Datasets

Although our main experiments use 2-second music samples, our proposed contrastive diffusion model is able to synthesize longer music sequences with reasonable coherence and rhythm, as seen in the AIST++ examples below. We also provide additional examples from the aforementioned TikTok dataset.

Preliminary Music Editing Results

Here we present some preliminary results for music editing, in which we replace the original paired motion input with a different dance-music type.

Changing dance-music genre from Breakdancing to Krumping.

Changing dance-music genre from LA style Hip-Hop to Pop.