Performance comparison of different closed-and open-source LLMs on
FannOrFlop.
Model | BLEU | chrF(++) | BERTScore | Textual Entailment | Faithfulness / Consistency | Fluency / Grammaticality | Interpretive Depth |
---|---|---|---|---|---|---|---|
Closed Models | |||||||
GPT-4o-2024-08-06 (OpenAI, 2024) | 0.0395 | 0.2882 | 0.6410 | 0.6775 | 3.92 (± 0.99) | 4.96 (± 0.20) | 7.52 |
GPT-4o-mini-2024-07-18 (OpenAI, 2024) | 0.0395 | 0.2542 | 0.6124 | 0.4383 | 2.91 (± 0.75) | 4.28 (± 0.57) | 7.50 |
Gemini-2.5-Flash (AI, 2025b) | 0.0153 | 0.2618 | 0.6319 | 0.7475 | 4.25 (± 1.00) | 4.98 (± 0.16) | 7.22 |
Gemini-2.0-Flash (AI, 2025a) | 0.0395 | 0.2618 | 0.6393 | 0.7154 | 3.99 (± 1.04) | 4.95 (± 0.22) | 6.50 |
Gemini-1.5-Pro (Reid et al., 2024) | 0.0395 | 0.2618 | 0.6333 | 0.6180 | 3.59 (± 1.00) | 4.80 (± 0.41) | 5.38 |
Fanar-Star (Team et al., 2025) | 0.0138 | 0.1538 | 0.5677 | 0.6468 | 2.16 (± 0.92) | 3.40 (± 0.76) | 2.88 |
Open Models | |||||||
Deepseek-V3 (Liu et al., 2024) | 0.0395 | 0.2771 | 0.6335 | 0.5117 | 3.36 (± 0.91) | 4.98 (± 0.16) | 4.75 |
Deepseek-R1 (Guo et al., 2025) | 0.0395 | 0.2771 | 0.6335 | 0.5117 | 3.38 (± 0.92) | 4.98 (± 0.16) | 4.25 |
Llama-3.3-70B (Meta AI, 2024) | 0.0153 | 0.2618 | 0.6393 | 0.5364 | 2.51 (± 0.90) | 3.37 (± 0.73) | 7.20 |
Qwen-3 (Team, 2025) | 0.0296 | 0.2837 | 0.6158 | 0.6468 | 3.98 (± 0.90) | 4.73 (± 0.45) | 6.50 |
Aya-Expanse (Dang et al., 2024) | 0.0329 | 0.2771 | 0.6328 | 0.6468 | 3.76 (± 0.90) | 4.68 (± 0.47) | 5.88 |
Jais (Sengupta et al., 2023) | 0.0312 | 0.2698 | 0.6245 | 0.6023 | 3.21 (± 0.88) | 4.35 (± 0.52) | 5.35 |
ALLaM-7B (Bari et al., 2024) | 0.0119 | 0.0463 | 0.5375 | 0.5997 | 1.32 (± 0.62) | 2.11 (± 0.89) | 3.12 |
AceGPT-v2-70B-Chat (Huang et al., 2023) | 0.0402 | 0.0412 | 0.5759 | 0.6061 | 2.52 (± 0.91) | 3.46 (± 0.95) | 4.12 |