How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

1Mohamed bin Zayed University of AI, 2ETH Zurich, 3Google, 4TU Munich, 5Linkoping University, 6Australian National University

Motivated by the expanding wide-scale applications of Video Large Multi-modal Models (Video-LMMs), and the lack of complex video understanding benchmarking, we present a new evaluation benchmark, Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for Video-LMMs. CVRR-ES comprehensively evaluates the recent Video-LMMs against their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries.



Left: Our Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) comprises 11 diverse video evaluation dimensions encompassing a variety of complex and real-world contexts for evaluating Video Large Multi-modal Models (Video-LMMs). Right: Overall performance of recent Video-LMMs on the CVRR-ES benchmark. Results for each Video-LMM are averaged across 11 video dimensions shown on the left.

We observe that most Video-LMMs struggle to reason over complex videos (rows 1-3) and exhibit weak robustness and rectification capabilities when prompted to generate answers for user questions that can sometimes be confusing (row 4). The QA pairs in our CVRR-Evaluation Suite assess the performance of Video-LMMs beyond general video comprehension.

Abstract

Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available.

CVRR-ES assesses the reasoning and robustness of Video-LMMs on complex videos in real-world contexts.

Main contributions:
  1. Complex Video Reasoning and Robustness Benchmark: We present CVRR-ES, an open-ended Video Question Answering benchmark designed to assess the reasoning and robustness capabilities of Video-LMMs across 11 diverse world-centric complex video dimensions.
  2. Comprehensive Evaluation: We evaluate recent Video-LMMs on the CVRR-ES benchmark and find that most models exhibit weak performance, highlighting their limited reasoning in complex videos and lack of robustness towards user text queries.
  3. Key Analysis: We conduct extensive analysis and formulate important conclusions about Video-LMMs based on their failure cases and performance on the CVRR-ES benchmark. Our findings provide valuable insights for building the next generation of human-centric AI systems with improved robustness and reasoning capabilities.
  4. Dual-Step Contextual Prompting Technique: To improve Video-LMMs' reasoning and robustness abilities, we formulate a model-agnostic, training-free prompting technique that effectively enhances their performance on the CVRR-ES benchmark.

CVRR-Evaluation Suite Overview

(Left) Frequency distribution of the type of questions. (Right) Illustration of the most frequent keywords in the answer-set of CVRR-ES benchmark.

Overview of CVRR-ES dataset. CVRR-ES benchmark consists of 2400 open-ended question-answer (QA) pairs spanning over 214 unique videos for evaluating Video-LMMs. The benchmark aims to assess their robustness to user textual queries and reasoning capabilities in a variety of complex and contextual videos covering 11 diverse evaluation dimensions. The evaluation dimensions are listed below:

  1. Multiple actions in a single video.
  2. Fine-grained action understanding.
  3. Partial actions.
  4. Time order understanding.
  5. Non-existent actions with existent scene depictions.
  6. Non-existent actions with non-existent scene depictions.
  7. Continuity and object instance count.
  8. Unusual and physically anomalous activities.
  9. Interpretation of social context.
  10. Understanding of emotional context.
  11. Interpretation of visual context.

We curate robustness (from the lens of complex and confusing user queries) and reasoning based high-quality question answer pairs across the evaluation dimensions in CVRR-ES. Please refer to the main paper for detailed definations about the evaluation dimensions.

Dual-Step Contextual Prompting technique for Video-LMMs

We note that majority of Video-LMMs are trained using only positive examples and video-conversational templates that are primarily limited to tasks such as video-captioning and video question answering. This leads to highly over-affirmative behavior and a lack of self-rectification abilities against noisy (e.g., confusing) user questions. Additionally, the templates have minimal focus on enhancing reasoning and robustness capabilities through reasoning-based instruction-tuning pairs, resulting in weak performance of such models against robustness and reasoning QA evaluations in the CVRR-ES benchmark.

To address these challenges, we introduce a prompting technique for Video-LMMs called Dual-Step Contextual Prompting (DSCP), which aims to steer Video-LMM focus for enhanced reasoning while simultaneously encouraging the models to provide robust and grounded answers. DSCP is a two-step prompting method that 1) uses principled prompt instructions (above figure, shown in blue) to ensure that the model comprehends the video while reasoning over crucial aspects of complex video understanding such as contextual information and decoding the complex relationships between objects and motions, etc., and 2) encourages robustness by generating the response against the question while conditioning both on video and the context retrieved in the first step (above figure, shown in green).

DSCP technique effectively enhances the performance of Video-LMMs on the CVRR-ES benchmark. Below we show some qualitative examples of the proposed DSCP method. Please refer to the main paper for more detailed information.

Experimental results on CVRR-Evaluation Suite

Performance of Video-LMMs on CVRR-ES

In below table, we present the evaluation results of Video-LMMs on the 11 dimension categories of the CVRR-ES benchmark. For each QA pair of CVRR-ES benchmark, we provide Video-LMM with the question alongside with the corresponding video, which generates prediction answer in an auto-regressive manner. Each QA pair is processed without maintaining the chat history. Finally, the predictions are assessed using a Judge LLM, which determines whether the prediction is correct or incorrect based on the ground-truth answer.

Benchmark Category Video-LLaMA-2 VideoChat Video-ChatGPT Video-LLaVA MovieChat LLaMA-VID TimeChat Gemini-V Pro Gemini-V Flash GPT4V GPT-4o Human
Multiple Actions in single video 16.98 23.90 27.67 15.72 12.58 17.92 28.30 43.08 44.65 57.55 62.89 93.40
Fine-grained action understanding 29.57 33.48 26.96 25.22 23.48 26.09 39.13 51.61 64.78 77.39 80.43 95.65
Partial actions 24.76 33.01 22.82 13.59 21.36 14.56 49.51 67.48 62.14 73.79 77.67 98.54
Time order understanding 16.45 31.58 27.63 21.05 16.45 19.74 34.21 45.39 55.26 57.89 71.05 97.37
Non-existent actions with existent scene 10.14 15.22 23.19 5.07 5.07 2.90 23.19 57.25 60.14 71.01 83.33 97.10
Non-existent actions with non-existent scene 13.19 14.58 17.36 3.47 11.81 6.94 13.89 49.64 56.30 75.00 70.14 100.00
Continuity and Object instance Count 28.25 24.29 28.41 21.47 19.77 24.86 34.46 36.16 43.50 62.71 62.71 96.49
Unusual and Physically Anomalous activities 18.95 18.42 18.95 15.79 17.89 16.32 27.37 60.00 60.53 74.74 78.42 96.84
Interpretation of social context 25.00 31.07 32.50 18.93 17.14 13.93 39.29 64.29 69.64 79.64 83.57 97.51
Understanding of emotional context 21.92 23.63 21.23 15.07 13.70 14.73 27.40 47.26 52.74 66.44 70.89 95.55
Interpretation of visual context 32.60 34.43 27.84 19.78 21.25 23.08 45.05 63.00 57.51 82.42 84.25 94.87
Average 21.62 25.78 24.96 15.92 16.41 16.46 32.89 53.20 57.51 70.78 75.03 96.67

We present results for both open-source and closed-source models, alongside human evaluation results which serves as the upper bound on the benchmark.


Effectiveness of DSCP method for improving Video-LMMs performance

We next integrate DSCP technique with Video-LMMs and present results on the CVRR-ES benchmark in below Figure. The results indicate that DSCP improves the model's performance compared with models that use standard prompting (i.e., using only the question itself). Gains of DSCP technique are shown in green.

Different prompting techniques

We study the contribution of each step of DSCP and compare it with chain-of-thought prompting method. The results for the top 5 performing Video-LMMs are shown in the below Table.

Prompting Method VideoChat Video-LLaVA MovieChat LLaMA-VID TimeChat
Standard prompting 25.78 15.92 16.41 16.46 32.89
Chain of Thought (CoT) 22.44 25.87 15.89 29.68 39.57
DSCP (Stage 1) 38.07 32.12 28.05 25.13 33.04
DSCP (Both stages) 47.92 37.93 35.87 46.85 39.45

Main findings and Qualitative Results

Based on the results of Video-LMMs on CVRR-ES, we draw key findings and show qualitative results. These insights can serve as valuable guidance for developing the next generation of Video-LMMs, aiming to make them more robust and reliable when deployed in real-world applications and interacting with humans in the wild.

1) Models excelling at standard VQA benchmarks struggle on CVRR-ES benchmark. Latest open-source Video-LMMs such as Video-LLaVA, MovieChat, and LLaMA-VID which are state-of-the-art on standard VQA benchmarks performs less effectively on the CVRR-ES benchmark. This suggests that current VQA benchmarks, like ActivityNet-QA and MSRVTT, do not adequately correlate with the complex video reasoning and robustness scenarios highlighted in our benchmark. Consequently, this also indicates that most newer Video-LMMs are heavily trained to excel on general video comprehension benchmarks while reducing their generalizability, reasoning, and robustness capabilities.

2) Over-affirmative behavior of open-source Video-LMMs. Another important observation about open-source models is their tendency to exhibit excessively positive and affirmative responses. As shown in Figure below, open-source Video-LMMs consistently respond with "Yes," when faced with simple reasoning questions as well as confusing questions that describe non-existent actions and objects. This highlights the vulnerability of these models when interacting with users in real-world scenarios.


3) Tendency towards activity completion. Most open-source Video-LMMs have shown weak performance on the evaluation dimension of partial actions in CVRR-ES, which contains videos focusing on incomplete or atomic actions. In Figure below, it can be observed that most open-source models tend to complete actions, even when only part of the action is provided in the video. To improve the performance of Video-LMMs, it is crucial to incorporate diverse action types during training, including partial and incomplete actions.


4) Weak Generalization to extreme OOD videos. With the exception of GPT4V and Gemini, Video-LMMs struggle with this dimension, indicating weak generalizability towards OOD videos containing the coexistence of unusual objects and activities that are extremely rare in typical videos.


5) Limited understanding of temporal order in complex videos. The CVRR-ES benchmark results show that Video-LMMs perform relatively better on the fine-grained action dimension compared to the time-order understanding dimension. We present failure cases related to time-order dimension in the Figure below. Majority of open-source Video-LMMs struggle with comprehending the correct temporal order of actions within a video.


6) Video-LMMs struggles in understanding the emotional and social context. The lower performance of Video-LMMs on social and emotional contextual dimensions in CVRR-ES highlights their limitations and lack of understanding of scenes based on contextual cues. For instance, as shown in Figure below (bottom row), GPT-4V struggles to comprehend a scene where a worker is attempting to prevent shoes from getting wet due to the rain by moving them under the shade.

Conclusion

Given the expanding role of Video-LMMs in practical world-centric applications, it is vital to ensure that these models perform robustly and exhibit human-like reasoning and interaction capabilities across various complex and real-world contexts. In this work, we present the CVRR-ES benchmark for Video-LMMs, aiming to evaluate Video-LMMs on these very fronts. Through extensive evaluations, we find that Video-LMMs, especially open-source ones, exhibit limited robustness and reasoning capabilities over complex videos involving real-world contexts. Based on our analysis, we formulate a training-free prompting technique that effectively improves the performance of Video-LMMs across various evaluation dimensions of the CVRR-ES benchmark. Furthermore, we analyze and investigate the failure cases of Video-LMMs on the CVRR-ES benchmark and deduce several important findings. We hope that the CVRR-ES benchmark, accompanied by our extensive analysis, will contribute towards building the next generation of advanced world-centric video understanding models.


For additional details about CVRR-Evaluation suite and experimental results, please refer to our main paper. Thank you!

BibTeX

@article{Khattak2024cvrres,
    title={How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs},
    author={khattak, Muhammad Uzair and Naeem, Muhammad Ferjad and Hassan, Jameel and Muzzamal, Naseer and Tombari, Federcio and Khan, Fahad Shahbaz and Khan, Salman},
    journal={arXiv:2405.03690},
    year={2024}
}