📝 Abstract
Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both—and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs.
⚙️ Pipeline
📊 Benchmark Comparison
Table 1: Comparison with Existing Benchmarks
LongShOTBench is the only benchmark that combines all three modalities with intent-driven Q&A, tool usage, and custom rubrics for interpretable evaluation.
| Benchmark | Visual | Audio | Speech | Open-Ended Q&A | Multi-Turn Q&A | Intent-Driven Q&A | Tool-Usage | Custom-Rubrics |
|---|---|---|---|---|---|---|---|---|
| MV-Bench | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |
| EgoSchema | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |
| LongVideoBench | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |
| Moviechat | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ | ✕ | ✕ |
| MLVU | ✓ | ✕ | ✕ | ✓ | ✕ | ✕ | ✕ | ✕ |
| SVBench | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ | ✕ | ✕ |
| LVBench | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |
| LvBench | ✓ | ✕ | ✓* | ✓ | ✕ | ✕ | ✕ | ✕ |
| Video-Holmes | ✓ | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |
| InfiniBench | ✓ | ✕ | ✓* | ✓ | ✕ | ✕ | ✕ | ✕ |
| Video-MME | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ |
| LongVALE | ✓ | ✓ | ✓ | ✕ | ✓ | ✕ | ✕ | ✕ |
| TriSense-2M | ✓ | ✓ | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ |
| DailyOmni | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ |
| LongShOTBench (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
*Subtitle aided
🎯 Results
Table 2: Performance on LongShOTBench (%)
Gemini: Gemini-2.5-Flash | LLaVA-OV: LLaVA-OneVision-Qwen2-7B-ov | LLaVA-NV: LLaVA-NeXT-Video-7B-hf | Qwen-VL: Qwen2.5-VL-7B-Instruct | InternVL: InternVL3.5-8B | Qwen-Omni: Qwen2.5-Omni-7B | Qwen3-VL: Qwen3-VL-8B-Instruct
| Task | Gemini | LLaVA-OV | LLaVA-NV | Qwen-VL | InternVL | Qwen-Omni | Qwen3-VL | LongShOTAgent (Ours) |
|---|---|---|---|---|---|---|---|---|
| Core Perception Tasks | ||||||||
| Entity Recognition | 43.62 | 8.12 | 11.16 | 20.54 | 19.95 | 17.03 | 27.30 | 42.84 |
| Event Understanding | 41.84 | 6.66 | 9.60 | 14.21 | 14.47 | 13.95 | 22.19 | 35.41 |
| Temporal Understanding | 41.23 | 7.50 | 10.34 | 14.11 | 15.88 | 14.17 | 23.08 | 31.35 |
| Audio Understanding | 37.46 | 6.08 | 9.53 | 9.07 | 12.36 | 16.20 | 26.22 | 35.51 |
| Avg. | 41.04 | 7.09 | 10.16 | 14.48 | 15.66 | 15.34 | 24.70 | 36.28 |
| Reasoning Tasks | ||||||||
| Causal Reasoning | 68.41 | 9.01 | 14.43 | 24.76 | 23.98 | 23.73 | 32.58 | 54.26 |
| Quantitative Reasoning | 49.56 | 1.79 | 2.92 | 13.34 | 14.64 | 12.92 | 20.42 | 45.16 |
| Compositional Reasoning | 57.37 | 11.70 | 14.60 | 19.56 | 22.14 | 19.53 | 33.13 | 45.52 |
| Comparative Analysis | 71.24 | 9.73 | 13.61 | 20.72 | 20.05 | 17.93 | 30.53 | 52.87 |
| Avg. | 61.65 | 8.06 | 11.39 | 19.59 | 20.20 | 18.53 | 29.16 | 49.45 |
| Information Tasks | ||||||||
| Information Retrieval | 61.02 | 9.14 | 13.42 | 18.78 | 21.41 | 18.60 | 29.58 | 48.87 |
| Summarization | 58.86 | 12.92 | 18.22 | 18.61 | 22.72 | 19.16 | 28.84 | 60.17 |
| Instruction Extraction | 46.53 | 8.33 | 9.90 | 14.62 | 15.91 | 13.04 | 23.50 | 38.47 |
| Sentiment Analysis | 52.18 | 5.31 | 7.86 | 13.10 | 14.03 | 15.23 | 22.59 | 33.70 |
| Avg. | 54.65 | 8.92 | 12.35 | 16.28 | 18.52 | 16.51 | 26.13 | 45.30 |
| Multimodal Tasks | ||||||||
| Multimodal Synthesis | 55.38 | 8.92 | 11.34 | 19.14 | 18.99 | 16.59 | 27.95 | 44.15 |
| Cross Modal Verification | 50.45 | 4.60 | 10.91 | 10.81 | 12.31 | 11.21 | 16.82 | 37.79 |
| Audio Visual Alignment | 50.89 | 9.33 | 15.19 | 21.05 | 24.23 | 22.41 | 29.58 | 44.26 |
| Motion Analysis | 61.22 | 16.98 | 47.17 | 40.57 | 54.72 | 42.45 | 71.70 | 64.15 |
| Avg. | 54.49 | 9.96 | 13.82 | 22.89 | 27.56 | 23.17 | 36.51 | 47.59 |
| Overall | 52.95 | 8.51 | 13.76 | 18.39 | 20.49 | 18.39 | 29.12 | 44.66 |
| Agentic Tasks | 40.27 | - | - | - | - | - | - | 38.25 |