Sara Ghaboural^1*, Ketan More^1*, Wafa Alghallabi¹, Omkar Thawakar¹, Jorma Laaksonen²,
Hisham Cholakkal^1, Salman Khan^1,³, Rao M. Anwer^1,²,

¹Mohamed bin Zayed University of AI, ²Aalto University ³Australian National University,

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Figure 1. ARB Scope and Diversity ARB comprises a wide array of multimodal reasoning samples, each combining a visual input with an Arabic question and detailed step-by-step reasoning with actions taken by step. The dataset spans 11 distinct domains, including visual reasoning, OCR and document understanding, chart and diagram interpretation, mathematical and logical inference, scientific and medical analysis, cultural and historical interpretation, remote sensing, agricultural image analysis, and complex visual perception—capturing the linguistic richness, cultural depth, and cross-domain complexity essential for evaluating reasoning in Arabic.

Overview

The Comprehensive Arabic Multimodal Reasoning Benchmark (ARB) is the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. Covering 11 diverse domains, ARB includes 1,356 multimodal samples and 5,119 curated reasoning steps. It provides a structured framework to assess the capabilities of open- and closed-source large multimodal models (LMMs), addressing gaps in coherence, cultural grounding, and faithfulness often overlooked in benchmarks focused solely on English.

ARB Construction Pipeline

Figure 2. ARB Pipeline The figure illustrates the ARB pipeline for evaluating Arabic multimodal reasoning in LMMs. It begins with data collection across 11 domains—such as medical imaging, historical interpretation, visual reasoning, and agriculture—sourced from curated datasets (e.g., VRC-Bench, CAMEL-Bench), synthetic content, tool-augmented outputs, and web scraping. Data is generated across five categories: English reasoning chains, Arabic Q\&A, English captions, synthetic samples, and tool-enhanced content. Reasoning steps are refined via human-in-the-loop feedback and filtered for logical consistency and cultural alignment. The benchmark supports fine-grained evaluation of open- and closed-source models on Arabic step-by-step reasoning.

ARB Data Collection

Figure 3.ARB Data Collection Overview of the ARB Data Collection, Generation and Verification Framework.} The ARB benchmark is constructed from five primary data sources: (1) English reasoning benchmarks, (2) Arabic question–answer benchmarks, (3) English-captioned datasets, (4) Synthetic data, and (5) Tool-augmented data. All data undergoes iterative refinement through human-in-the-loop feedback and validation by native Arabic speakers to ensure cultural and linguistic fidelity.

ARB Data Distribution

Figure 4. A Domain Distribution in ARB. The figure shows the distribution of ARB samples across 11 domains. Math & Logic (41%) and Charts, Diagrams, & Tables (24%) dominate, reflecting the dataset’s emphasis on structured reasoning. Other domains, including Social & Cultural, Scientific, and Medical, add thematic diversity.

Evaluation Metric

Figure 5. Arabic Reasoning Evaluation Metrics. We assess step-by-step reasoning using five core Arabic-specific dimensions: Faithfulness (At-Tat¯abuq), Informativeness (Al-Ithr¯a’ Al-Ma’l¯um¯at¯ı), Coherence (At-Taw¯afuq), Commonsense (Al-Mantiq Al-’A¯mm), and Reasoning Alignment (At-Tawa¯fuq Al-Istidla¯l¯ı). Auxiliary checks cover hallucinations, redundancy, semantic gaps, and missing steps. Metrics are defined at the step and/or token level. The full evaluation rubric is provided in English in Appendix E.

Model	GPT-4o	GPT-4o-min	GPT-4.1	o4-mini	Gemini 1.5 Pro	Gemini 2.0 Flash
Final Answer (%)	60.22🏅	52.22	59.43	58.93	56.70	57.80
Reasoning Steps (%)	64.29	61.02	80.41	80.75🏅	64.34	64.09

Table 1: Stepwise Evaluation Using LLM-as-Judge – Closed-Source Models. Comparison of closed-weight models based on final answer accuracy and aggregated quality scores of reasoning steps, using our LLM-as-Judge framework with Arabic prompts and evaluation metrics. The evaluation follows a reference-based, attribute-level protocol for assessing reasoning quality.

Model	Qwen2.5VL-7b	Llama-3.2-11B-Vis-Inst.	AIN	Llama-4-Scout-17Bx16E	Aya-Vision-8B	InternVl3-8B
Final Answer (%)	37.02	25.58	27.35	48.52🏅	28.81	31.04
Reasoning Steps (%)	64.03	53.20	52.77	77.70🏅	63.64	54.50

Table2: Stepwise Evaluation Using LLM-as-Judge- Open-Source Modles. Comparison of closed-weight models based on final answer accuracy and aggregated quality scores of reasoning steps, using our LLM-as-Judge framework with Arabic prompts and evaluation metrics. The evaluation follows a reference-based, attribute-level protocol for assessing reasoning quality.

Qualitative Errors in Closed-Source Models

Figure 6. Qualitative Errors in Closed-Source Models. This figure highlights reasoning failures by closed-source LMMs across various Arabic multimodal tasks. Common issues include incorrect numerical comparisons, invalid assumptions, misinterpreted constraints, and logically inconsistent step sequences. These errors often lead to incorrect conclusions despite the appearance of structured reasoning, underscoring the limitations of current closed models when operating in Arabic.

Qualitative Errors in Open-Source Models

Figure 7. Qualitative Errors in Open-Source Models. This figure showcases common reasoning flaws in open-source LMMs across diverse Arabic multimodal tasks. Errors include incomplete reasoning steps, inconsistent logic, and hallucinated interpretations not grounded in the input. These issues often result in incorrect answers or unreliable outputs, reflecting the challenges open models face in structured Arabic reasoning.

Cross-Lingual Reasoning Comparison

Figure 8. Cross-Lingual Reasoning Comparison (Arabic vs. English). This figure compares LMMs (GPT-4o) reasoning steps in Arabic and English for the same visual task. In the Arabic version, the model misinterprets structural constraints, yellow highlights incorrect assumptions about equal line counts across boxes, green emphasizes miscounted lines within the boxes, and cyan marks an irrelevant search for a box with exactly 4 lines. These reasoning flaws lead to the wrong answer (C). In contrast, the English reasoning is structured, accurate, and constraint-aware, correctly identifying the answer (A), highlighting the performance gap in Arabic.

BibTeX

@misc{ghaboura2025arbcomprehensivearabicmultimodal,
  title={ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark}, 
  author={Sara Ghaboura and Ketan More and Wafa Alghallabi and Omkar Thawakar and Jorma Laaksonen and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
  year={2025},
  eprint={2505.17021},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.17021}, 
}

ARB

A Comprehensive Arabic Reasoning Benchmark