EvoLMM_logo

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Mohamed bin Zayed University of AI (MBZUAI), Australian National University, Aalto University, Linköping University
*Equally contributing first authors

EvoLMM is a fully unsupervised self-evolving framework for LMMs that improves visual reasoning from raw images only, by coupling a Proposer and a Solver trained via continuous self-consistency rewards.

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto ∼3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion.

🔥Highlights

  1. We introduce a self-evolving multimodal framework, named EvoLMM, that enables a base LMM to improve without human labels, metadata, or external reward models. The framework decomposes the model into two internal roles, Proposer and Solver, forming a closed-loop propose-solve cycle trained solely through internal consistency feedback.

  2. We develop a continuous self-rewarding mechanism based on multi-sample answer consistency, which replaces both learned discrete reward models and semantic similarity scoring used in prior LMM self-evolution approaches. This continuous internal reward signal provides smooth gradients and stable optimization, enabling consistent improvement in performance.

  3. We empirically validate EvoLMM on mathematical visual reasoning benchmarks, with absolute gains of ~2–3% over the Qwen-2.5-VL-7B baseline using only raw images during training. We further analyze the evolution of our propose-solve mechanism where the difficulty level gradually progresses and maintains stable learning, showing that the model naturally develops more structured and grounded reasoning behaviors over time. Furthermore, we show that internal consistency can serve as a viable supervision signal for open-ended multimodal learning.

EvoLMM_logo EvoLMM Pipeline

EvoLMM Pipeline Diagram
The Proposer generates visually grounded questions from raw images; the Solver answers them multiple times. Self-consistency among answers yields continuous rewards for the Solver, while the Proposer is rewarded for mid-entropy (moderate difficulty) questions—forming a closed-loop self-evolving curriculum.

EvoLMM_logo Continuous Reward: Why it Stabilizes Self-Evolution

Early in training, Solver outputs over images are diverse; discrete majority-vote rewards become sparse and unstable. EvoLMM replaces them with a continuous self-consistency signal that scales smoothly with agreement and penalizes verbosity lightly. The Proposer uses an entropy-guided band-pass reward that peaks at moderate difficulty, discouraging trivial or unsolvable questions and inducing an emergent curriculum as the Solver improves. Both roles are optimized via KL-regularized REINFORCE with moving baselines for stability. Thus, our proposed continuous self-evolving design provides a non-zero learning signal even when the model is uncertain, avoiding stagnation observed in the discrete-reward self-questioning scheme. Further, our design enables the Proposer to continuously adjust question difficulty to match the Solver’s evolving capability, thereby mitigating model collapse.

EvoLMM_logo Main Results

Evaluation results across eight multimodal mathematical and visual reasoning benchmarks.
Model ChartQA MathVista MathVision MathVerse InfoGraphic-VQAval AI2D ScienceQA MMMUval
Vision-Zero† (CLEVR) 84.2468.4323.9643.8680.3582.6488.5051.44
Qwen2.5-VL-7B (Baseline) 84.0068.4623.9143.7880.4482.6188.3051.11
Qwen2.5-VL-7B + Discrete Reward 84.6268.8822.5242.1080.5282.1887.9850.84
Qwen2.5-VL-7B + Continuous Reward (EvoLMM) 86.7070.5224.8144.8881.0683.4189.5052.01
Δ Improvement +2.70+2.06+0.90+1.10+0.62+0.80+1.20+0.90

† uses external supervision.


Comparison of our EvoLMM self-evolving framework under different parameter update strategies.
Model ChartQA MathVista MathVision MathVerse InfoGraphic-VQAval AI2D ScienceQA MMMUval
Qwen2.5-VL-7B (Baseline) 84.0068.4623.9143.7880.4482.6188.3051.11
Qwen2.5-VL-7B + LoRA (EvoLMM) 86.7070.5224.8144.8881.0683.4189.5052.01
Qwen2.5-VL-7B + QLoRA 85.3268.9223.9743.8280.8382.7588.7351.71
Qwen2.5-VL-7B + Full-Finetune 84.2068.4123.3743.7780.3782.6488.1251.23

Effectiveness of our EvoLMM self-evolving framework across different large multimodal backbones.
Model ChartQA MathVista MathVision MathVerse InfoGraphic-VQAval AI2D ScienceQA MMMUval
Qwen2.5-VL-7B (Base) 84.0068.4623.9143.7880.4482.6188.3051.11
Qwen2.5-VL-7B (EvoLMM) 86.7070.5224.8144.8881.0683.4189.5052.01
InternVL3-8B-Instruct (Base) 82.4065.2025.3631.6268.7783.1997.7752.78
InternVL3-8B-Instruct (EvoLMM) 84.9767.2026.4432.9269.3983.9598.1353.77
Gemma-3-12B-It (Base) 55.6460.1324.5328.9650.6979.0583.8948.11
Gemma-3-12B-It (EvoLMM) 58.6162.1325.6130.2651.3779.8584.9749.10
Llama-3.2-11B-Vision-Instruct (Base) 29.2446.5923.4737.2356.6946.4456.8747.93
Llama-3.2-11B-Vision-Instruct (EvoLMM) 32.2448.5924.5538.5357.3747.3258.0748.92

Scaling behaviour of our EvoLMM self-evolving framework across model sizes in the Qwen2.5-VL family.
Model ChartQA MathVista MathVision MathVerse InfoGraphic-VQAval AI2D ScienceQA MMMUval
Qwen2.5-VL-7B (Base) 84.0068.2023.9143.78 80.4482.6188.3051.11
Qwen2.5-VL-7B (EvoLMM) 86.7070.5224.8144.88 81.0683.4189.5052.01
Qwen2.5-VL-72B (Base) 88.2073.9336.9254.09 85.9787.3493.3665.86
Qwen2.5-VL-72B (EvoLMM) 91.0476.4438.3155.45 86.6388.1994.6367.02

Citation


@misc{thawakar2025evolmmselfevolvinglargemultimodal,
      title={EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards}, 
      author={Omkar Thawakar and Shravan Venkatraman and Ritesh Thawkar and Abdelrahman Shaker and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Khan},
      year={2025},
      eprint={2511.16672},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.16672}, 
}
    

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to open-source LMM projects for releasing models/code and templates.

IVAL Logo Oryx Logo MBZUAI Logo