Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto ∼3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion.
EvoLMM Pipeline
Continuous Reward: Why it Stabilizes Self-Evolution
Early in training, Solver outputs over images are diverse; discrete majority-vote rewards become sparse and unstable. EvoLMM replaces them with a continuous self-consistency signal that scales smoothly with agreement and penalizes verbosity lightly. The Proposer uses an entropy-guided band-pass reward that peaks at moderate difficulty, discouraging trivial or unsolvable questions and inducing an emergent curriculum as the Solver improves. Both roles are optimized via KL-regularized REINFORCE with moving baselines for stability. Thus, our proposed continuous self-evolving design provides a non-zero learning signal even when the model is uncertain, avoiding stagnation observed in the discrete-reward self-questioning scheme. Further, our design enables the Proposer to continuously adjust question difficulty to match the Solver’s evolving capability, thereby mitigating model collapse.
Main Results
| Model | ChartQA | MathVista | MathVision | MathVerse | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval |
|---|---|---|---|---|---|---|---|---|
| Vision-Zero†(CLEVR) | 84.24 | 68.43 | 23.96 | 43.86 | 80.35 | 82.64 | 88.50 | 51.44 |
| Qwen2.5-VL-7B (Baseline) | 84.00 | 68.46 | 23.91 | 43.78 | 80.44 | 82.61 | 88.30 | 51.11 |
| Qwen2.5-VL-7B + Discrete Reward | 84.62 | 68.88 | 22.52 | 42.10 | 80.52 | 82.18 | 87.98 | 50.84 |
| Qwen2.5-VL-7B + Continuous Reward (EvoLMM) | 86.70 | 70.52 | 24.81 | 44.88 | 81.06 | 83.41 | 89.50 | 52.01 |
| Δ Improvement | +2.70 | +2.06 | +0.90 | +1.10 | +0.62 | +0.80 | +1.20 | +0.90 |
†uses external supervision.
| Model | ChartQA | MathVista | MathVision | MathVerse | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Baseline) | 84.00 | 68.46 | 23.91 | 43.78 | 80.44 | 82.61 | 88.30 | 51.11 |
| Qwen2.5-VL-7B + LoRA (EvoLMM) | 86.70 | 70.52 | 24.81 | 44.88 | 81.06 | 83.41 | 89.50 | 52.01 |
| Qwen2.5-VL-7B + QLoRA | 85.32 | 68.92 | 23.97 | 43.82 | 80.83 | 82.75 | 88.73 | 51.71 |
| Qwen2.5-VL-7B + Full-Finetune | 84.20 | 68.41 | 23.37 | 43.77 | 80.37 | 82.64 | 88.12 | 51.23 |
| Model | ChartQA | MathVista | MathVision | MathVerse | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Base) | 84.00 | 68.46 | 23.91 | 43.78 | 80.44 | 82.61 | 88.30 | 51.11 |
| Qwen2.5-VL-7B (EvoLMM) | 86.70 | 70.52 | 24.81 | 44.88 | 81.06 | 83.41 | 89.50 | 52.01 |
| InternVL3-8B-Instruct (Base) | 82.40 | 65.20 | 25.36 | 31.62 | 68.77 | 83.19 | 97.77 | 52.78 |
| InternVL3-8B-Instruct (EvoLMM) | 84.97 | 67.20 | 26.44 | 32.92 | 69.39 | 83.95 | 98.13 | 53.77 |
| Gemma-3-12B-It (Base) | 55.64 | 60.13 | 24.53 | 28.96 | 50.69 | 79.05 | 83.89 | 48.11 |
| Gemma-3-12B-It (EvoLMM) | 58.61 | 62.13 | 25.61 | 30.26 | 51.37 | 79.85 | 84.97 | 49.10 |
| Llama-3.2-11B-Vision-Instruct (Base) | 29.24 | 46.59 | 23.47 | 37.23 | 56.69 | 46.44 | 56.87 | 47.93 |
| Llama-3.2-11B-Vision-Instruct (EvoLMM) | 32.24 | 48.59 | 24.55 | 38.53 | 57.37 | 47.32 | 58.07 | 48.92 |
| Model | ChartQA | MathVista | MathVision | MathVerse | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Base) | 84.00 | 68.20 | 23.91 | 43.78 | 80.44 | 82.61 | 88.30 | 51.11 |
| Qwen2.5-VL-7B (EvoLMM) | 86.70 | 70.52 | 24.81 | 44.88 | 81.06 | 83.41 | 89.50 | 52.01 |
| Qwen2.5-VL-72B (Base) | 88.20 | 73.93 | 36.92 | 54.09 | 85.97 | 87.34 | 93.36 | 65.86 |
| Qwen2.5-VL-72B (EvoLMM) | 91.04 | 76.44 | 38.31 | 55.45 | 86.63 | 88.19 | 94.63 | 67.02 |
@misc{thawakar2025evolmmselfevolvinglargemultimodal,
title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards},
author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan},
year={2025},
eprint={2511.16672},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.16672},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to open-source LMM projects for releasing models/code and templates.