The Comprehensive Arabic Multimodal Reasoning Benchmark (ARB) is the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. Covering 11 diverse domains, ARB includes 1,356 multimodal samples and 5,119 curated reasoning steps. It provides a structured framework to assess the capabilities of open- and closed-source large multimodal models (LMMs), addressing gaps in coherence, cultural grounding, and faithfulness often overlooked in benchmarks focused solely on English.
Model | GPT-4o | GPT-4o-min | GPT-4.1 | o4-mini | Gemini 1.5 Pro | Gemini 2.0 Flash |
---|---|---|---|---|---|---|
Final Answer (%) | 60.22🏅 | 52.22 | 59.43 | 58.93 | 56.70 | 57.80 |
Reasoning Steps (%) | 64.29 | 61.02 | 80.41 | 80.75🏅 | 64.34 | 64.09 |
Model | Qwen2.5VL-7b | Llama-3.2-11B-Vis-Inst. | AIN | Llama-4-Scout-17Bx16E | Aya-Vision-8B | InternVl3-8B |
---|---|---|---|---|---|---|
Final Answer (%) | 37.02 | 25.58 | 27.35 | 48.52🏅 | 28.81 | 31.04 |
Reasoning Steps (%) | 64.03 | 53.20 | 52.77 | 77.70🏅 | 63.64 | 54.50 |
@misc{ghaboura2025arbcomprehensivearabicmultimodal,
title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark},
author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2505.17021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.17021},
}