What is UnAV-100?

We explore the problem of dense-localizing audio-visual events: recognizing and localizing all audio-visual events occurring in an untrimmed video.

We contribute UnAV-100 as the first audio-visual dataset based on untrimmed videos. Different from the previous AVE dataset, our UnAV-100 consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.

Statistics

UnAV-100 contains audio-visual events spanning a wide range of domains, including human activities, music performances, animals/vehicles/tools/natural sounds, etc. All audio-visual events occurring in each video are annotated with corresponding event categories and accurate temporal boundaries.

Distribution of Audio-Visual Events

Distribution of video and event duration

Event number in videos

Comparison with related audio-visual datasets

Downloads

To download raw videos, we provide a csv file. Each line in the csv file has columns defined by:

# YouTube ID, start second, end second, train/validation/test split

Dataset

Code

Annotations and features (audio and visual) are also available at Baidu Drive (pwd: 6c48)

Models trained on UnAV-100 and evaluation scripts

LICENCE Creative Commons License

The UnAV-100 dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. Please contact the authors if you have any queries regarding the dataset.

Publication

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng

CVPR 2023

PDF   |   Arxiv   |   Bibtex
@inproceedings{geng2023dense,
    title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
    author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={22942--22951},
    year={2023}
    }