We explore the problem of dense-localizing audio-visual events: recognizing and localizing all audio-visual events occurring in an untrimmed video.
We contribute UnAV-100 as the first audio-visual dataset based on untrimmed videos. Different from the previous AVE dataset, our UnAV-100 consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.
UnAV-100 contains audio-visual events spanning a wide range of domains, including human activities, music performances, animals/vehicles/tools/natural sounds, etc. All audio-visual events occurring in each video are annotated with corresponding event categories and accurate temporal boundaries.
Distribution of Audio-Visual Events
Distribution of video and event duration
Event number in videos
Comparison with related audio-visual datasets
To download raw videos, we provide a csv file. Each line in the csv file has columns defined by:
# YouTube ID, start second, end second, train/validation/test split
Annotations and features (audio and visual) are also available at Baidu Drive (pwd: 6c48)
Models trained on UnAV-100 and evaluation scripts
The UnAV-100 dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. Please contact the authors if you have any queries regarding the dataset.
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
CVPR 2023
@inproceedings{geng2023dense, title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline}, author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={22942--22951}, year={2023} }