Logo KITAB-Bench

A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Ahmed Heakl1*,2, Abdullah Sohail1*, Mukul Ranjan1*, Rania Hossam1*, Ghazi Ahmed1, Mohamed El-Geish2, Omar Maher2, Zhiqiang Shen1, Fahad S. Khan1, 3, Salman Khan1, 4,

* Equal Contributions

1Mohamed bin Zayed University of AI, 2Monta AI, 3Linköping University, 4Australian National University

KITAB-Bench Domains

The proposed KITAB-Bench covers nine diverse and challenging domains: layout detection, line recognition, image-to-text, VQA, diagram-to-code, table recognition, charts-to-JSON, PDF-to-Markdown conversion, and Arabic numericals. KITAB-Bench spans 36 sub-domains with over 8,809 samples, carefully curated to rigorously evaluate essential skills required for Arabic OCR and document analysis. The benchmark is designed to assess structured text recognition, complex layouts, handwritten text, tables, and chart understanding—providing a comprehensive evaluation framework for modern Arabic OCR models and vision-language systems.

News Icon News

[2025-02-10]: Our KITAB-Bench is now available on HuggingFace. We welcome all contributions and look forward to your participation!
[2025-02-15]: KITAB-Bench is submitted in ACL 2025.

Abstract

With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

Challenges in Arabic OCR

KITAB-Bench Domains

Task Categories in KITAB-Bench

KITAB-Bench Domains

Data Generation Pipeline

KITAB-Bench Domains
Arabic OCR Benchmark Results

Performance Comparison of SOTA Models on KITAB-Bench

Dataset Size GPT-4o GPT-4o-mini Gemini-2.0-Flash Qwen2-VL
CER WER CER WER CER WER CER WER
PATS 500 0.23 0.30 0.53 0.71 0.01 0.02 1.02 1.02
SythenAR 500 0.09 0.20 0.14 0.32 0.07 0.17 0.59 1.13
HistoryAr 200 0.51 0.82 0.67 0.96 0.28 0.64 3.46 2.86
HistoricalBooks 10 0.41 0.76 0.59 0.88 0.05 0.22 1.90 2.16
Khatt 200 0.45 0.74 0.64 0.91 0.19 0.45 1.12 5.04
Adab 200 0.30 0.73 0.35 0.83 0.19 0.56 0.63 1.08
Muharaf 200 0.56 0.90 0.63 0.94 0.33 0.69 3.57 2.87
OnlineKhatt 200 0.29 0.63 0.41 0.76 0.17 0.44 1.30 2.01
ISI-PPT 500 0.08 0.18 0.15 0.31 0.06 0.15 1.03 1.06
ArabicOCR 50 0.06 0.26 0.16 0.46 0.00 0.02 1.25 1.50
Hindawi 200 0.34 0.56 0.48 0.71 0.01 0.04 1.82 2.05
EvArest 800 0.20 0.38 0.25 0.51 0.18 0.36 0.41 0.95
Total 3,760 0.31 0.55 0.43 0.71 0.13 0.32 1.48 1.20