Logo CAMEL-Bench

A Comprehensive Arabic LMM Benchmark

Sara Ghaboura1*, Ahmed Heakl1*, Omkar Thawakar1, Ali Alharthi1, Ines Riahi2, Abduljalil Saif2, Jorma Laaksonen2, Fahad S. Khan1, 3, Salman Khan1, 4, Rao M. Anwer1, 2

* Equal Contributions

1Mohamed bin Zayed University of AI, 2Aalto University, 3Linköping University, 4Australian National University

geometric reasoning

The proposed CAMEL-Bench covers eight diverse and challenging domains: multimodal understanding and reasoning, OCR and document understanding, chart and diagram understanding, video understanding, cultural-specific understanding, medical imaging understanding, agricultural image understanding, and remote sensing understanding in Arabic. CAMEL-Bench covers 38 sub-domains with over 29K questions carefully curated by native Arabic speakers to rigorously evaluate essential skills desired in Arabic LMMs.

News Icon News

[2024-10-24]: Our CAMEL-Bench is now available on HuggingFace. We welcome all contributions and look forward to your participation!

Abstract

Recent years have witnessed a significant interest in developing large multi-modal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers.

The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark will be publicly released.

Leaderboard on CAMEL-Bench

Performance comparison of different closed-and open-source LMMs on Logo CAMEL-Bench.

# Model Source ALL MM Understand. & Reasoning OCR & Document Understanding Charts & Diagrams Understanding Video Understanding Cultural Specific Understanding Medical Imaging Agro Specific Remote Sensing
1 GPT-4o 🥇 Link 62.40 57.90 59.11 73.57 74.27 80.86 49.90 80.75 22.85
2 GPT-4o-mini 🥈 Link 54.54 48.82 42.89 64.98 68.11 65.92 47.37 79.58 16.93
3 Qwen2-VL-7B 🥉 Link 54.45 51.35 49.06 55.39 62.64 75.64 39.42 79.84 22.28
4 Gemini-1.5-Pro Link 52.38 46.67 36.59 47.06 42.94 56.24 33.77 72.12 17.07
5 Gemini-1.5-Flash Link 45.14 45.58 33.59 48.25 53.31 46.54 42.86 76.06 14.95
6 LLaVa-OneVision-7B Link 40.45 42.90 31.35 40.86 29.41 66.02 27.29 75.03 10.72
7 Pangea-7B-Instruct Link 34.90 40.09 17.75 38.75 49.01 20.34 31.99 74.51 6.67
8 Qwen2-VL-2B Link 32.62 40.59 25.68 27.83 38.90 34.27 29.12 52.02 12.56
9 InternVL2-8B Link 30.26 30.41 15.91 30.27 51.42 20.88 29.48 44.47 5.36
10 LLaVa-NeXt-7B Link 27.38 26.33 19.12 27.56 44.90 28.30 22.54 42.00 8.33

CAMEL-Bench Diversity

CAMEL-B Image