MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

Prerit Gupta, Jason Alexander Fotso-Puepi, Zhengyuan Li, Jay Mehta, Aniket Bera

Purdue University, West Lafayette
ICCV 2025

Paper arXiv Data (to be released) Code (to be released)

We present the Multimodal DuetDance (MDD) dataset with duet dance motions, music and multi-faceted text annotations, capturing diverse set of movement vocabulary for spatial relationships, body movement and rhythm. Our dataset supports two downstream tasks: Text-to-Duet (Text2Duet) and Text-to-Dance Accompaniment (Text2DanceAcc). The figure displays an annotated sample in our dataset from the genre category, Jive.

This Figure shows two samples with annotated descriptions from our MDD Dataset. The first sample is taken from Paso Doble where the dancers do a Cape lift followed by an Open Spin with a Left-side turn. The second sample belongs to the Salsa genre where the leader initiates a right turn, followed by a Hammerlock position right-side turn. On the right, we show the statistics for the percentage durations of motion, annotations, and Beats per minute (BPM) genre-wise.

Abstract

We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance synthesis. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner.

Video Presentation

Ballroom

Latin

Social

BibTeX

@InProceedings{Gupta_2025_ICCV,
    author    = {Gupta, Prerit and Fotso-Puepi, Jason Alexander and Li, Zhengyuan and Mehta, Jay and Bera, Aniket},
    title     = {MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {13932-13941}
}