
We explore the problem of dense-localizing audio-visual events: recognizing and localizing all audio-visual events occurring in an untrimmed video.
We contribute UnAV-100 as the first audio-visual dataset based on untrimmed videos. Different from the previous AVE dataset, our UnAV-100 consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.
UnAV-100 contains audio-visual events spanning a wide range of domains, including human activities, music performances, animals/vehicles/tools/natural sounds, etc. All audio-visual events occurring in each video are annotated with corresponding event categories and accurate temporal boundaries.
Distribution of Audio-Visual Events

Distribution of video and event duration

Event number in videos
Comparison with related audio-visual datasets


To download raw videos, we provide a csv file. Each line in the csv file has columns defined by:
# YouTube ID, start second, end second, train/validation/test split
Annotations and features (audio and visual) are also available at Baidu Drive (pwd: 6c48)
Models trained on UnAV-100 and evaluation scripts

The UnAV-100 dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. Please contact the authors if you have any queries regarding the dataset.
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
CVPR 2023
@inproceedings{geng2023dense,
title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={22942--22951},
year={2023}
}