The Arabic INclusive Multimodal Model

Ahmed Heakl1*, Sara Ghaboura1*, Omkar Thawakar1,
Fahad S. Khan1, 2, Hisham Cholakkal1, Rao M. Anwer1, 3, Salman Khan1, 4

1Mohamed bin Zayed University of AI, 3Aalto University, 2LinkΓΆping University, 4Australian National University

Paper Paper AIN arxiv Code Github Code Model
AIN Can See

AIN: A versatile LMM excelling in visual and contextual understanding across diverse domains, including VQA on complex topics, OCR for various fonts and handwriting, cultural insights (traditions, food, places), agricultural tasks (crop identification, fruit classification, disease detection), remote sensing (multi-scale objects), medical imaging (various modalities), and video analysis (animation, human activities)

Abstract

Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN - the Arabic Inclusive Multimodal Model- designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.

πŸ“’ Latest Updates

πŸ”₯ Jan 2025 AIN-7B model, the first Arabic Inclusive LMM, is released πŸ€—
πŸš€ 4 Mar 2025 Model weights released on Hugging Face.

🌟 Key Features

Performance Analysis

Figure 1: The chart showcases a comprehensive performance analysis of AIN-7B across CAMEL-Bench domains, comparing it with prominent closed-source models and open-source counterparts.OCR: "OCR & Document Understanding", Video: "General Video & Multi-Image Understanding", RS: "Remote Sensing Understanding", CDT: "Chart, Diagram & Table Understanding", Agro.: "Agricultural Image Understanding", Cultural: "Cultural-Specific Understanding", Medical: "Medical Image Understanding".

Comparative Performance

Figure 2: Comparative performance of AIN-7B against other models across key domains, including OCR & Document Understanding, Remote Sensing, Agricultural Understanding, and overall performance across all domains.

Models VQA OCR Video RS CDT Agro. Cult. Med. Total
GPT-4o πŸ₯ˆ55.15 πŸ₯ˆ54.98 πŸ₯‡69.65 πŸ₯ˆ27.36 πŸ₯ˆ62.35 πŸ₯ˆ80.75 πŸ₯‡80.86 πŸ₯‡49.91 πŸ₯ˆ60.13
GPT-4o-mini 48.83 39.38 πŸ₯ˆ 66.28 16.93 56.37 78.80 65.92 πŸ₯ˆ 47.37 52.49
Gemini-1.5-Pro 46.68 28.68 42.95 17.07 47.06 72.14 56.24 33.78 52.38
Gemini-1.5-flash 45.59 27.58 53.31 14.95 48.26 76.07 46.54 42.87 44.40
InternVL-8B 30.41 15.91 51.42 5.36 30.27 44.47 20.88 29.48 28.52
InternVL2.5-1B 27.22 19.45 38.20 3.39 30.75 39.53 35.68 21.27 26.94
Qwen-VL-2B 41.02 22.93 38.90 12.56 27.83 52.02 34.28 29.12 32.33
AIN-7B (ours) πŸ₯‡56.78 πŸ₯‡72.35 64.09 πŸ₯‡45.92 πŸ₯‡64.10 πŸ₯‡85.05 πŸ₯ˆ78.09 43.77 πŸ†63.77

Table 1. Performance comparison of AIN and different closed- and open-source LMMs across CAMEL-Bench domains.Best performance is marked with πŸ₯‡; second-best is πŸ₯ˆ.

Qualitative Examples

Figure 3: Qualitative examples showcasing AIN-7B’s capabilities across various domains, including general VQA, OCR & Document Understanding, Remote Sensing, Medical Imaging, Agricultural Understanding, and Cultural-Specific tasks.

🧐 Data Verification and Toxicity Filtering

Verification Pipeline

Figure 4: Data verification and filtering pipeline for textual and visual data, ensuring high-quality training data through semantic similarity checks, translation quality evaluations, and toxicity screening for safety compliance.

πŸ“š Citation

@misc{heakl2025ainarabicinclusivelarge,
  title={AIN: The Arabic INclusive Large Multimodal Model},
  author={Ahmed Heakl and Sara Ghaboura and Omkar Thawkar and Fahad Shahbaz Khan and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan},
  year={2025},
  url={https://arxiv.org/abs/2502.00094},
}
IVAL Logo Oryx Logo MBZUAI Logo