LongShOT: A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath1*, Jaseel Muhammad Kaithakkodan1*, Jinxing Zhou1, Sahal Shaji Mullappilly1, Mohammad Almansoori1, Noor Ahsan1, Beknur Kalmakhanbet1, Sambal Shikhar1, Rishabh Lalla1, Jean Lahoud1, Mariette Awad2, Fahad Shahbaz Khan1,3, Salman Khan1, Rao Muhammad Anwer1, Hisham Cholakkal1

*Equal Contribution

1Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)    2American University of Beirut    3Linkoping University

📝 Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both—and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs.

💬 3,092 Q&A Samples
📋 16 Task Categories
🎬 45 min Avg. Video Duration
100% Human Verified

⚙️ Pipeline

LongShOTBench Construction Pipeline
Figure 1: Construction pipeline of LongShOTBench. The pipeline begins with raw video data where speech, visuals, and audio cues are extracted. These are passed into multimodal processing to generate segment-wise aligned and fused metadata. Only the distilled information flows to question design, where scenarios and question types are mapped, followed by the generation of questions and conversational answers. Next, verifiable rubrics are created to evaluate correctness and difficulty. Finally, the core dataset—comprising Q&A pairs and tailored evaluation rubrics—is manually reviewed and corrected by human validators, ensuring a clean, reliable benchmark.

📊 Benchmark Comparison

Table 1: Comparison with Existing Benchmarks

LongShOTBench is the only benchmark that combines all three modalities with intent-driven Q&A, tool usage, and custom rubrics for interpretable evaluation.

Benchmark Visual Audio Speech Open-Ended Q&A Multi-Turn Q&A Intent-Driven Q&A Tool-Usage Custom-Rubrics
MV-Bench
EgoSchema
LongVideoBench
Moviechat
MLVU
SVBench
LVBench
LvBench *
Video-Holmes
InfiniBench *
Video-MME
LongVALE
TriSense-2M
DailyOmni
LongShOTBench (Ours)

*Subtitle aided

🎯 Results

Table 2: Performance on LongShOTBench (%)

Gemini: Gemini-2.5-Flash  |  LLaVA-OV: LLaVA-OneVision-Qwen2-7B-ov  |  LLaVA-NV: LLaVA-NeXT-Video-7B-hf  |  Qwen-VL: Qwen2.5-VL-7B-Instruct  |  InternVL: InternVL3.5-8B  |  Qwen-Omni: Qwen2.5-Omni-7B  |  Qwen3-VL: Qwen3-VL-8B-Instruct

Task Gemini LLaVA-OV LLaVA-NV Qwen-VL InternVL Qwen-Omni Qwen3-VL LongShOTAgent (Ours)
Core Perception Tasks
Entity Recognition 43.62 8.12 11.16 20.54 19.95 17.03 27.30 42.84
Event Understanding 41.84 6.66 9.60 14.21 14.47 13.95 22.19 35.41
Temporal Understanding 41.23 7.50 10.34 14.11 15.88 14.17 23.08 31.35
Audio Understanding 37.46 6.08 9.53 9.07 12.36 16.20 26.22 35.51
Avg. 41.04 7.09 10.16 14.48 15.66 15.34 24.70 36.28
Reasoning Tasks
Causal Reasoning 68.41 9.01 14.43 24.76 23.98 23.73 32.58 54.26
Quantitative Reasoning 49.56 1.79 2.92 13.34 14.64 12.92 20.42 45.16
Compositional Reasoning 57.37 11.70 14.60 19.56 22.14 19.53 33.13 45.52
Comparative Analysis 71.24 9.73 13.61 20.72 20.05 17.93 30.53 52.87
Avg. 61.65 8.06 11.39 19.59 20.20 18.53 29.16 49.45
Information Tasks
Information Retrieval 61.02 9.14 13.42 18.78 21.41 18.60 29.58 48.87
Summarization 58.86 12.92 18.22 18.61 22.72 19.16 28.84 60.17
Instruction Extraction 46.53 8.33 9.90 14.62 15.91 13.04 23.50 38.47
Sentiment Analysis 52.18 5.31 7.86 13.10 14.03 15.23 22.59 33.70
Avg. 54.65 8.92 12.35 16.28 18.52 16.51 26.13 45.30
Multimodal Tasks
Multimodal Synthesis 55.38 8.92 11.34 19.14 18.99 16.59 27.95 44.15
Cross Modal Verification 50.45 4.60 10.91 10.81 12.31 11.21 16.82 37.79
Audio Visual Alignment 50.89 9.33 15.19 21.05 24.23 22.41 29.58 44.26
Motion Analysis 61.22 16.98 47.17 40.57 54.72 42.45 71.70 64.15
Avg. 54.49 9.96 13.82 22.89 27.56 23.17 36.51 47.59
Overall 52.95 8.51 13.76 18.39 20.49 18.39 29.12 44.66
Agentic Tasks 40.27 - - - - - - 38.25