Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts.
However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available.
(Left) Frequency distribution of the type of questions. (Right) Illustration of the most frequent keywords in the answer-set of CVRR-ES benchmark.
Overview of CVRR-ES dataset. CVRR-ES benchmark consists of 2400 open-ended question-answer (QA) pairs spanning over 214 unique videos for evaluating Video-LMMs. The benchmark aims to assess their robustness to user textual queries and reasoning capabilities in a variety of complex and contextual videos covering 11 diverse evaluation dimensions. The evaluation dimensions are listed below:
We note that majority of Video-LMMs are trained using only positive examples and video-conversational templates that are primarily limited to tasks such as video-captioning and video question answering. This leads to highly over-affirmative behavior and a lack of self-rectification abilities against noisy (e.g., confusing) user questions. Additionally, the templates have minimal focus on enhancing reasoning and robustness capabilities through reasoning-based instruction-tuning pairs, resulting in weak performance of such models against robustness and reasoning QA evaluations in the CVRR-ES benchmark.
To address these challenges, we introduce a prompting technique for Video-LMMs called Dual-Step Contextual Prompting (DSCP), which aims to steer Video-LMM focus for enhanced reasoning while simultaneously encouraging the models to provide robust and grounded answers. DSCP is a two-step prompting method that 1) uses principled prompt instructions (above figure, shown in blue) to ensure that the model comprehends the video while reasoning over crucial aspects of complex video understanding such as contextual information and decoding the complex relationships between objects and motions, etc., and 2) encourages robustness by generating the response against the question while conditioning both on video and the context retrieved in the first step (above figure, shown in green).
DSCP technique effectively enhances the performance of Video-LMMs on the CVRR-ES benchmark. Below we show some qualitative examples of the proposed DSCP method. Please refer to the main paper for more detailed information.
In below table, we present the evaluation results of Video-LMMs on the 11 dimension categories of the CVRR-ES benchmark. For each QA pair of CVRR-ES benchmark, we provide Video-LMM with the question alongside with the corresponding video, which generates prediction answer in an auto-regressive manner. Each QA pair is processed without maintaining the chat history. Finally, the predictions are assessed using a Judge LLM, which determines whether the prediction is correct or incorrect based on the ground-truth answer.
Benchmark Category | Video-LLaMA-2 | VideoChat | Video-ChatGPT | Video-LLaVA | MovieChat | LLaMA-VID | TimeChat | Gemini-V Pro | Gemini-V Flash | GPT4V | GPT-4o | Human | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Multiple Actions in single video | 16.98 | 23.90 | 27.67 | 15.72 | 12.58 | 17.92 | 28.30 | 43.08 | 44.65 | 57.55 | 62.89 | 93.40 | ||
Fine-grained action understanding | 29.57 | 33.48 | 26.96 | 25.22 | 23.48 | 26.09 | 39.13 | 51.61 | 64.78 | 77.39 | 80.43 | 95.65 | ||
Partial actions | 24.76 | 33.01 | 22.82 | 13.59 | 21.36 | 14.56 | 49.51 | 67.48 | 62.14 | 73.79 | 77.67 | 98.54 | ||
Time order understanding | 16.45 | 31.58 | 27.63 | 21.05 | 16.45 | 19.74 | 34.21 | 45.39 | 55.26 | 57.89 | 71.05 | 97.37 | ||
Non-existent actions with existent scene | 10.14 | 15.22 | 23.19 | 5.07 | 5.07 | 2.90 | 23.19 | 57.25 | 60.14 | 71.01 | 83.33 | 97.10 | ||
Non-existent actions with non-existent scene | 13.19 | 14.58 | 17.36 | 3.47 | 11.81 | 6.94 | 13.89 | 49.64 | 56.30 | 75.00 | 70.14 | 100.00 | ||
Continuity and Object instance Count | 28.25 | 24.29 | 28.41 | 21.47 | 19.77 | 24.86 | 34.46 | 36.16 | 43.50 | 62.71 | 62.71 | 96.49 | ||
Unusual and Physically Anomalous activities | 18.95 | 18.42 | 18.95 | 15.79 | 17.89 | 16.32 | 27.37 | 60.00 | 60.53 | 74.74 | 78.42 | 96.84 | ||
Interpretation of social context | 25.00 | 31.07 | 32.50 | 18.93 | 17.14 | 13.93 | 39.29 | 64.29 | 69.64 | 79.64 | 83.57 | 97.51 | ||
Understanding of emotional context | 21.92 | 23.63 | 21.23 | 15.07 | 13.70 | 14.73 | 27.40 | 47.26 | 52.74 | 66.44 | 70.89 | 95.55 | ||
Interpretation of visual context | 32.60 | 34.43 | 27.84 | 19.78 | 21.25 | 23.08 | 45.05 | 63.00 | 57.51 | 82.42 | 84.25 | 94.87 | ||
Average | 21.62 | 25.78 | 24.96 | 15.92 | 16.41 | 16.46 | 32.89 | 53.20 | 57.51 | 70.78 | 75.03 | 96.67 |
We present results for both open-source and closed-source models, alongside human evaluation results which serves as the upper bound on the benchmark.
We next integrate DSCP technique with Video-LMMs and present results on the CVRR-ES benchmark in below Figure. The results indicate that DSCP improves the model's performance compared with models that use standard prompting (i.e., using only the question itself). Gains of DSCP technique are shown in green.
We study the contribution of each step of DSCP and compare it with chain-of-thought prompting method. The results for the top 5 performing Video-LMMs are shown in the below Table.
Prompting Method | VideoChat | Video-LLaVA | MovieChat | LLaMA-VID | TimeChat | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Standard prompting | 25.78 | 15.92 | 16.41 | 16.46 | 32.89 | ||||||
Chain of Thought (CoT) | 22.44 | 25.87 | 15.89 | 29.68 | 39.57 | ||||||
DSCP (Stage 1) | 38.07 | 32.12 | 28.05 | 25.13 | 33.04 | ||||||
DSCP (Both stages) | 47.92 | 37.93 | 35.87 | 46.85 | 39.45 |
Based on the results of Video-LMMs on CVRR-ES, we draw key findings and show qualitative results. These insights can serve as valuable guidance for developing the next generation of Video-LMMs, aiming to make them more robust and reliable when deployed in real-world applications and interacting with humans in the wild.
1) Models excelling at standard VQA benchmarks struggle on CVRR-ES benchmark. Latest open-source Video-LMMs such as Video-LLaVA, MovieChat, and LLaMA-VID which are state-of-the-art on standard VQA benchmarks performs less effectively on the CVRR-ES benchmark. This suggests that current VQA benchmarks, like ActivityNet-QA and MSRVTT, do not adequately correlate with the complex video reasoning and robustness scenarios highlighted in our benchmark. Consequently, this also indicates that most newer Video-LMMs are heavily trained to excel on general video comprehension benchmarks while reducing their generalizability, reasoning, and robustness capabilities.
2) Over-affirmative behavior of open-source Video-LMMs. Another important observation about open-source models is their tendency to exhibit excessively positive and affirmative responses. As shown in Figure below, open-source Video-LMMs consistently respond with "Yes," when faced with simple reasoning questions as well as confusing questions that describe non-existent actions and objects. This highlights the vulnerability of these models when interacting with users in real-world scenarios.
3) Tendency towards activity completion. Most open-source Video-LMMs have shown weak performance on the evaluation dimension of partial actions in CVRR-ES, which contains videos focusing on incomplete or atomic actions. In Figure below, it can be observed that most open-source models tend to complete actions, even when only part of the action is provided in the video. To improve the performance of Video-LMMs, it is crucial to incorporate diverse action types during training, including partial and incomplete actions.
4) Weak Generalization to extreme OOD videos. With the exception of GPT4V and Gemini, Video-LMMs struggle with this dimension, indicating weak generalizability towards OOD videos containing the coexistence of unusual objects and activities that are extremely rare in typical videos.
5) Limited understanding of temporal order in complex videos. The CVRR-ES benchmark results show that Video-LMMs perform relatively better on the fine-grained action dimension compared to the time-order understanding dimension. We present failure cases related to time-order dimension in the Figure below. Majority of open-source Video-LMMs struggle with comprehending the correct temporal order of actions within a video.
6) Video-LMMs struggles in understanding the emotional and social context. The lower performance of Video-LMMs on social and emotional contextual dimensions in CVRR-ES highlights their limitations and lack of understanding of scenes based on contextual cues. For instance, as shown in Figure below (bottom row), GPT-4V struggles to comprehend a scene where a worker is attempting to prevent shoes from getting wet due to the rain by moving them under the shade.
Given the expanding role of Video-LMMs in practical world-centric applications, it is vital to ensure that these models perform robustly and exhibit human-like reasoning and interaction capabilities across various complex and real-world contexts. In this work, we present the CVRR-ES benchmark for Video-LMMs, aiming to evaluate Video-LMMs on these very fronts. Through extensive evaluations, we find that Video-LMMs, especially open-source ones, exhibit limited robustness and reasoning capabilities over complex videos involving real-world contexts. Based on our analysis, we formulate a training-free prompting technique that effectively improves the performance of Video-LMMs across various evaluation dimensions of the CVRR-ES benchmark. Furthermore, we analyze and investigate the failure cases of Video-LMMs on the CVRR-ES benchmark and deduce several important findings. We hope that the CVRR-ES benchmark, accompanied by our extensive analysis, will contribute towards building the next generation of advanced world-centric video understanding models.
For additional details about CVRR-Evaluation suite and experimental results, please refer to our main paper. Thank you!
@article{Khattak2024cvrres,
title={How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs},
author={khattak, Muhammad Uzair and Naeem, Muhammad Ferjad and Hassan, Jameel and Muzzamal, Naseer and Tombari, Federcio and Khan, Fahad Shahbaz and Khan, Salman},
journal={arXiv:2405.03690},
year={2024}
}