Fann or Flop

A Multigenre, Multiera Benchmark for Arabic Poetry Understanding

Wafa Alghallabi^1,²^*, Ritesh Thawkar^1,²^*, Sara Ghaboura²^*, Ketan More^1,²^*, Omkar Thawakar^1,²^*,
Hisham Cholakkal², Salman Khan^1,^2,³, Rao Muhammad Anwer^2,⁴,

* Equal Contributions

¹Lawa.AI, ²Mohamed bin Zayed University of AI, ³Australian National University, ⁴Aalto University

Paper arXiv Code 🤗 Dataset

Chronological Wheel of Arabic Poetic Eras. This circular taxonomy visualizes the evolution of Arabic poetry across twelve major historical eras, from the Pre-Islamic and Transitional periods through the Abbasid, Andalusian, and Mamluk dynasties, up to the Modern and Contemporary era. The layout reflects both temporal flow and the rich cultural shifts that shaped poetic expression.

News

[2025-05-23]: Our FannOrFlop is now available on HuggingFace. We welcome all contributions and look forward to your participation!

Abstract

Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release \emph{Fann or Flop} along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models.

Fann or Flop Pipeline. Fann or Flop is built out of the multi-stage pipeline. It begins with scraping Arabic poems from a trusted online archive using a custom web scraper. Extracted poems are matched to an initial expert-verified taxonomy and filtered to remove duplicates, ambiguous metadata, and invalid entries. The filtered texts then undergo normalization (e.g., unifying diacritics, punctuation, and letter forms) and Arabic-specific tokenization, with non-poetic or irrelevant content excluded. Manual corrections are applied to fix OCR and encoding errors. In the final stage, linguistic experts verify each sample to ensure proper alignment with genre and era labels.

LeftDistribution of poems by historical era. The chart shows the proportion of poems collected from each era. Abbasid, Modern, and Andalusian periods are the most represented, reflecting their central role in Arabic literary production. Right. Distribution of poems by genre. This chart shows the proportion of poetic genres across the dataset. Praise, Satire, and Love dominate the distribution, while genres such as Apology and Sadness appear less frequently.

Genre distribution across historical eras. This stacked bar chart illustrates how poetic themes evolved across different dynasties. It highlights patterns such as the prominence of Praise and Satire during the Abbasid and Umayyad eras, and the diverse thematic expression in Modern poetry.

Fann or Flop Samples by Genre. Additional representative examples from the Fann or Flop benchmark, illustrating the diversity of genres covered, including Love (Ghazal), Praise (Madḥ), Wisdom (Hikma), Satire (Hijā’), Elegy (Rithā’), Reproach ('Itāb), Political Poetry, and Longing (Shawq). Each example showcases a poetic excerpt alongside an interpretive breakdown highlighting figurative language, rhetorical devices, and thematic nuances. These curated samples reflect the benchmark’s aim to evaluate models’ nuanced understanding of Arabic poetic tradition.

Leaderboard on Fann Or Flop

Performance comparison of different closed-and open-source LLMs on Logo FannOrFlop.

Model	BLEU	chrF(++)	BERTScore	Textual Entailment	Faithfulness / Consistency	Fluency / Grammaticality	Interpretive Depth
Closed Models
GPT-4o-2024-08-06 (OpenAI, 2024)	0.0395	0.2882	0.6410	0.6775	3.92 (± 0.99)	4.96 (± 0.20)	7.52
GPT-4o-mini-2024-07-18 (OpenAI, 2024)	0.0395	0.2542	0.6124	0.4383	2.91 (± 0.75)	4.28 (± 0.57)	7.50
Gemini-2.5-Flash (AI, 2025b)	0.0153	0.2618	0.6319	0.7475	4.25 (± 1.00)	4.98 (± 0.16)	7.22
Gemini-2.0-Flash (AI, 2025a)	0.0395	0.2618	0.6393	0.7154	3.99 (± 1.04)	4.95 (± 0.22)	6.50
Gemini-1.5-Pro (Reid et al., 2024)	0.0395	0.2618	0.6333	0.6180	3.59 (± 1.00)	4.80 (± 0.41)	5.38
Fanar-Star (Team et al., 2025)	0.0138	0.1538	0.5677	0.6468	2.16 (± 0.92)	3.40 (± 0.76)	2.88
Open Models
Deepseek-V3 (Liu et al., 2024)	0.0395	0.2771	0.6335	0.5117	3.36 (± 0.91)	4.98 (± 0.16)	4.75
Deepseek-R1 (Guo et al., 2025)	0.0395	0.2771	0.6335	0.5117	3.38 (± 0.92)	4.98 (± 0.16)	4.25
Llama-3.3-70B (Meta AI, 2024)	0.0153	0.2618	0.6393	0.5364	2.51 (± 0.90)	3.37 (± 0.73)	7.20
Qwen-3 (Team, 2025)	0.0296	0.2837	0.6158	0.6468	3.98 (± 0.90)	4.73 (± 0.45)	6.50
Aya-Expanse (Dang et al., 2024)	0.0329	0.2771	0.6328	0.6468	3.76 (± 0.90)	4.68 (± 0.47)	5.88
Jais (Sengupta et al., 2023)	0.0312	0.2698	0.6245	0.6023	3.21 (± 0.88)	4.35 (± 0.52)	5.35
ALLaM-7B (Bari et al., 2024)	0.0119	0.0463	0.5375	0.5997	1.32 (± 0.62)	2.11 (± 0.89)	3.12
AceGPT-v2-70B-Chat (Huang et al., 2023)	0.0402	0.0412	0.5759	0.6061	2.52 (± 0.91)	3.46 (± 0.95)	4.12

BibTeX

@misc{alghallabi2025fannflopmultigenremultiera,
      title={Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs}, 
      author={Wafa Alghallabi and Ritesh Thawkar and Sara Ghaboura and Ketan More and Omkar Thawakar and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
      year={2025},
      eprint={2505.18152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18152}, 
}