DuwatBench EACL 2026

دواة - معيار الخط العربي

Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

*Equal Contribution

1Mohamed bin Zayed University of AI    2NUCES    3NUST    4Australian National University

Latest Updates

[04 Jan 2026] DuwatBench accepted to EACL 2026 Main Track!
[22 Jan 2026] DuwatBench - the open-source Arabic Calligraphy Benchmark for Multimodal Understanding is released.
[23 Jan 2026] DuwatBench dataset available on HuggingFace.
DuwatBench Taxonomy

DuwatBench encompasses six principal Arabic calligraphy styles with 1,272 curated samples and ~1,475 unique words, including bounding box annotations for detection-oriented evaluation.

1,272
Curated Samples
6
Calligraphy Styles
13
Models Evaluated
~1,475
Unique Words

Overview

DuwatBench is a comprehensive benchmark for evaluating LMMs on Arabic calligraphy recognition. Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. DuwatBench addresses the gap in evaluating how well modern AI systems can process stylized Arabic text.

Style Statistics
Category Statistics

Figure 1. Left: Proportional breakdown of calligraphic styles. Right: Proportional breakdown of textual categories.

Key Features

1,272 Curated Samples

Spanning 6 classical and modern calligraphic styles

9.5k+ Word Instances

Approximately 1,475 unique words spanning religious and cultural domains

Bounding Box Annotations

For detection-level evaluation

Full Transcriptions

With style and theme labels

Artistic Backgrounds

Preserving real-world visual complexity

Multi-tier Verification

Ensuring annotation quality

DuwatBench Creation Pipeline

DuwatBench Pipeline

Figure 2. End-to-end pipeline for constructing DuwatBench, from data collection and manual transcription with bounding boxes to multi-tier verification and style/theme aggregation.

Six Calligraphic Styles

الثلث

Thuluth

706 samples (55%)

Ornate script used in mosque decorations

الديواني

Diwani

230 samples (18%)

Flowing Ottoman court script

النسخ

Naskh

110 samples (9%)

Standard readable script

الكوفي

Kufic

83 samples (7%)

Geometric angular early Arabic script

الرقعة

Ruq'ah

76 samples (6%)

Modern everyday handwriting

النستعليق

Nasta'liq

67 samples (5%)

Persian-influenced flowing script

Dataset Examples

Sample Calligraphy

Evaluation Metrics

MetricDescription
CERCharacter Error Rate - edit distance at character level
WERWord Error Rate - edit distance at word level
chrFCharacter n-gram F-score - partial match robustness
ExactMatchStrict full-sequence accuracy
NLDNormalized Levenshtein Distance - balanced error measure

Benchmark Results

Open-Source Models (8)

ModelCER ↓WER ↓chrF ↑ExactMatch ↑NLD ↓
MBZUAI/AIN*0.54940.691242.670.18950.5134
Gemma-3-27B-IT0.55560.659151.530.23980.4741
Qwen2.5-VL-72B0.57090.703943.980.17610.5298
Qwen2.5-VL-7B0.64530.776836.970.12110.5984
InternVL3-8B0.75880.882221.750.05740.7132
EasyOCR0.85380.989512.300.00310.8163
TrOCR-Arabic*0.97280.99981.790.00000.9632
LLaVA-v1.6-Mistral-7B0.99320.99989.160.00000.9114

* Arabic-specific models

Closed-Source Models (5)

ModelCER ↓WER ↓chrF ↑ExactMatch ↑NLD ↓
Gemini-2.5-flash0.37000.447871.820.41670.3166
Gemini-1.5-flash0.39330.511263.280.35220.3659
GPT-4o0.47660.569256.850.33880.4245
GPT-4o-mini0.60390.707742.670.21150.5351
Claude-Sonnet-4.50.64940.725542.970.22250.5599

Per-Style WER Performance

Word Error Rate (WER ↓) across calligraphy styles - Full Image mode

ModelKuficThuluthDiwaniNaskhRuq'ahNasta'liq
Gemini-2.5-flash0.70670.35270.56980.47650.58170.5222
Gemini-1.5-flash0.72120.47410.57830.44440.54450.5023
GPT-4o0.80410.55400.63700.41890.55070.4434
Gemma-3-27B-IT0.78020.63150.73260.51380.75710.6637
MBZUAI/AIN0.79160.70360.71300.53670.61110.6916

Key Findings

Best Performance

Gemini-2.5-flash achieves the best overall performance with 41.67% exact match accuracy and the lowest CER of 0.37.

Best Styles

Models perform best on Naskh and Ruq'ah scripts due to their standardized strokes and clear letterforms.

Challenging Styles

Diwani and Thuluth (ornate scripts with dense ligatures) remain challenging for all models.

BBox Improvement

Bounding box localization improves performance across most models.

Qualitative Comparison

Qualitative Results

Figure 3. Qualitative results comparing open- and closed-source models on DuwatBench calligraphy samples.

Citation

@article{duwatbench2025,
  title={DuwatBench: Bridging Language and Visual Heritage through an
         Arabic Calligraphy Benchmark for Multimodal Understanding},
  author={Patle, Shubham and Ghaboura, Sara and Tariq, Hania and
          Khan, Mohammad Usman and Thawakar, Omkar and
          Anwer, Rao Muhammad and Khan, Salman},
  journal={arXiv preprint arXiv:2502.14865},
  year={2025}
}
    

Acknowledgments

IVAL Oryx MBZUAI