AIN: A versatile LMM excelling in visual and contextual understanding across diverse domains, including VQA on complex topics, OCR for various fonts and handwriting, cultural insights (traditions, food, places), agricultural tasks (crop identification, fruit classification, disease detection), remote sensing (multi-scale objects), medical imaging (various modalities), and video analysis (animation, human activities)
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN - the Arabic Inclusive Multimodal Model- designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.
π₯ Jan 2025 AIN-7B model, the first Arabic Inclusive LMM, is released π€
π 4 Mar 2025 Model weights released on Hugging Face.
Figure 1: The chart showcases a comprehensive performance analysis of AIN-7B across CAMEL-Bench domains, comparing it with prominent closed-source models and open-source counterparts.OCR: "OCR & Document Understanding", Video: "General Video & Multi-Image Understanding", RS: "Remote Sensing Understanding", CDT: "Chart, Diagram & Table Understanding", Agro.: "Agricultural Image Understanding", Cultural: "Cultural-Specific Understanding", Medical: "Medical Image Understanding".
Figure 2: Comparative performance of AIN-7B against other models across key domains, including OCR & Document Understanding, Remote Sensing, Agricultural Understanding, and overall performance across all domains.
Models | VQA | OCR | Video | RS | CDT | Agro. | Cult. | Med. | Total |
---|---|---|---|---|---|---|---|---|---|
GPT-4o | π₯55.15 | π₯54.98 | π₯69.65 | π₯27.36 | π₯62.35 | π₯80.75 | π₯80.86 | π₯49.91 | π₯60.13 |
GPT-4o-mini | 48.83 | 39.38 | π₯ 66.28 | 16.93 | 56.37 | 78.80 | 65.92 | π₯ 47.37 | 52.49 |
Gemini-1.5-Pro | 46.68 | 28.68 | 42.95 | 17.07 | 47.06 | 72.14 | 56.24 | 33.78 | 52.38 |
Gemini-1.5-flash | 45.59 | 27.58 | 53.31 | 14.95 | 48.26 | 76.07 | 46.54 | 42.87 | 44.40 |
InternVL-8B | 30.41 | 15.91 | 51.42 | 5.36 | 30.27 | 44.47 | 20.88 | 29.48 | 28.52 |
InternVL2.5-1B | 27.22 | 19.45 | 38.20 | 3.39 | 30.75 | 39.53 | 35.68 | 21.27 | 26.94 |
Qwen-VL-2B | 41.02 | 22.93 | 38.90 | 12.56 | 27.83 | 52.02 | 34.28 | 29.12 | 32.33 |
AIN-7B (ours) | π₯56.78 | π₯72.35 | 64.09 | π₯45.92 | π₯64.10 | π₯85.05 | π₯78.09 | 43.77 | π63.77 |
Table 1. Performance comparison of AIN and different closed- and open-source LMMs across CAMEL-Bench domains.Best performance is marked with π₯; second-best is π₯.
Figure 3: Qualitative examples showcasing AIN-7Bβs capabilities across various domains, including general VQA, OCR & Document Understanding, Remote Sensing, Medical Imaging, Agricultural Understanding, and Cultural-Specific tasks.
Figure 4: Data verification and filtering pipeline for textual and visual data, ensuring high-quality training data through semantic similarity checks, translation quality evaluations, and toxicity screening for safety compliance.