LongShOT - A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📝 Abstract

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both—and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs.

💬 3,092 Q&A Samples

📋 16 Task Categories

🎬 45 min Avg. Video Duration

✅ 100% Human Verified

⚙️ Pipeline

📊 Benchmark Comparison

Table 1: Comparison with Existing Benchmarks

LongShOTBench is the only benchmark that combines all three modalities with intent-driven Q&A, tool usage, and custom rubrics for interpretable evaluation.

Benchmark	Visual	Audio	Speech	Open-Ended Q&A	Multi-Turn Q&A	Intent-Driven Q&A	Tool-Usage	Custom-Rubrics
MV-Bench	✓	✕	✕	✕	✕	✕	✕	✕
EgoSchema	✓	✕	✕	✕	✕	✕	✕	✕
LongVideoBench	✓	✕	✕	✕	✕	✕	✕	✕
Moviechat	✓	✕	✕	✓	✓	✕	✕	✕
MLVU	✓	✕	✕	✓	✕	✕	✕	✕
SVBench	✓	✕	✕	✓	✓	✕	✕	✕
LVBench	✓	✕	✕	✕	✕	✕	✕	✕
LvBench	✓	✕	✓^*	✓	✕	✕	✕	✕
Video-Holmes	✓	✓	✕	✕	✕	✕	✕	✕
InfiniBench	✓	✕	✓^*	✓	✕	✕	✕	✕
Video-MME	✓	✓	✓	✕	✕	✕	✕	✕
LongVALE	✓	✓	✓	✕	✓	✕	✕	✕
TriSense-2M	✓	✓	✓	✓	✓	✕	✕	✕
DailyOmni	✓	✓	✓	✕	✕	✕	✕	✕
LongShOTBench (Ours)	✓	✓	✓	✓	✓	✓	✓	✓

^*Subtitle aided

🎯 Results

Table 2: Performance on LongShOTBench (%)

Task	Gemini	LLaVA-OV	LLaVA-NV	Qwen-VL	InternVL	Qwen-Omni	Qwen3-VL	LongShOTAgent (Ours)
Core Perception Tasks
Entity Recognition	43.62	8.12	11.16	20.54	19.95	17.03	27.30	42.84
Event Understanding	41.84	6.66	9.60	14.21	14.47	13.95	22.19	35.41
Temporal Understanding	41.23	7.50	10.34	14.11	15.88	14.17	23.08	31.35
Audio Understanding	37.46	6.08	9.53	9.07	12.36	16.20	26.22	35.51
Avg.	41.04	7.09	10.16	14.48	15.66	15.34	24.70	36.28
Reasoning Tasks
Causal Reasoning	68.41	9.01	14.43	24.76	23.98	23.73	32.58	54.26
Quantitative Reasoning	49.56	1.79	2.92	13.34	14.64	12.92	20.42	45.16
Compositional Reasoning	57.37	11.70	14.60	19.56	22.14	19.53	33.13	45.52
Comparative Analysis	71.24	9.73	13.61	20.72	20.05	17.93	30.53	52.87
Avg.	61.65	8.06	11.39	19.59	20.20	18.53	29.16	49.45
Information Tasks
Information Retrieval	61.02	9.14	13.42	18.78	21.41	18.60	29.58	48.87
Summarization	58.86	12.92	18.22	18.61	22.72	19.16	28.84	60.17
Instruction Extraction	46.53	8.33	9.90	14.62	15.91	13.04	23.50	38.47
Sentiment Analysis	52.18	5.31	7.86	13.10	14.03	15.23	22.59	33.70
Avg.	54.65	8.92	12.35	16.28	18.52	16.51	26.13	45.30
Multimodal Tasks
Multimodal Synthesis	55.38	8.92	11.34	19.14	18.99	16.59	27.95	44.15
Cross Modal Verification	50.45	4.60	10.91	10.81	12.31	11.21	16.82	37.79
Audio Visual Alignment	50.89	9.33	15.19	21.05	24.23	22.41	29.58	44.26
Motion Analysis	61.22	16.98	47.17	40.57	54.72	42.45	71.70	64.15
Avg.	54.49	9.96	13.82	22.89	27.56	23.17	36.51	47.59
Overall	52.95	8.51	13.76	18.39	20.49	18.39	29.12	44.66
Agentic Tasks	40.27	-	-	-	-	-	-	38.25