Visual events significantly impact human cognition, yet their intricate structure, multi-level semantics, and dynamic evolution pose challenges for AI in understanding video events. To address this, we propose the video event understanding task, composed of four progressive sub-tasks: video event boundary prediction, video event prediction, video relation classification, and video event inductive reasoning. These aim to enhance high-level event scene comprehension, bridge the event understanding and reasoning gap between CV and NLP, and explore AI's leap from perception to cognition. To our knowledge, our work is the first to support extracting highly summarized events and analyzing long-term event evolution.
To facilitate this task, we introduce VidEvent, a large - scale dataset of over 1,000 meticulously annotated film - explanation videos. It contains over 23,000 high - level semantic events and over 17,000 event relations with accurate evolutionary logic. Created through rigorous annotation, VidEvent ensures high data quality and reliability.
We also put forward baseline methods and evaluation metrics to establish a comprehensive benchmark for future research. These models can serve as future research benchmarks, promoting comparison and improvement. By analyzing VidEvent and the baseline models, we highlight the dataset's potential in advancing video event understanding and encourage the exploration of more innovative algorithms and models.
Our dataset is showcased on the website VidEvent, and our paper "VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos" has been accepted by AAAI 2025.
