Self-evolving multimodal post-training

Ask, Solve, Generate

Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards

Ritesh Thawkar1 Shravan Venkatraman1 Omkar Thawakar1 Abdelrahman M Shaker1 Fahad Shahbaz Khan1,3 Hisham Cholakkal1 Salman Khan1 Rao Muhammad Anwer1,2
1Mohamed Bin Zayed University of Artificial Intelligence 2Aalto University 3Linköping University
High-level overview of the self-evolving framework for unified understanding and generation.
Three LoRA roles operate on a frozen unified backbone: the Proposer asks, the Solver answers and evaluates, and the Generator synthesizes images under Solver-mediated rewards.

TL;DR

Can a unified model improve understanding and generation using only unlabeled images?

Most unified understanding-and-generation models still depend on curated post-training supervision — VQA labels, preference data, captions, or external reward models. We show that a single self-evolving recipe can improve both capabilities from raw images alone, with no human annotations and no task-trained reward or judge models, and that the same recipe transfers across diffusion, rectified-flow, and autoregressive backbones.

No labels

Training uses a 10k-image unlabeled pool. All captions, boxes, and QA annotations are discarded — supervision is derived entirely from the model's own consistency.

No external judges

Rewards come from the model itself through self-consistency and a Solver that evaluates its own generations — no GPT-as-judge, reward model, or preference data.

One recipe, three paradigms

The same role decomposition, reward design, and schedule improve all eight understanding metrics and lift GenEval by +3 points on BLIP3o-8B, BAGEL, and VARGPT-v1.1.

Why it is hard

Removing labels exposes two failure modes.

Naively training on self-consistency collapses the learning signal and decouples the two tasks. Our two core ideas directly target these failures.

01

Reward degeneracy

Sample-level self-consistency saturates: when every prompt framing yields the same answer, entropy hits zero even if the model is confidently wrong, leaving no curriculum signal. Our fix: Solver Token Entropy (STE), a continuous token-level difficulty signal that stays informative after agreement saturates.

02

Weak cross-task coupling

Understanding and generation are usually optimized with separate objectives, so better comprehension does little for image synthesis. Our fix: the Solver becomes the internal evaluator for generation, so understanding gains directly sharpen generation-side rewards.

Framework

Three internal roles on one frozen backbone.

We add no architecture-specific modules. The base model stays frozen, and three lightweight LoRA adapters take on the Proposer, Solver, and Generator roles under a shared reward design.

Detailed Proposer, Solver, and Generator framework with understanding and generation loops.
01

Ask Proposer

Samples visual questions from unlabeled images, rewarded for questions near the Solver's competence frontier — neither trivial nor unsolvable.

02

Solve Solver

Answers under N=7 prompt framings, supplies Solver Token Entropy, and doubles as the internal evaluator that scores generated images.

03

Generate Generator

Synthesizes candidate images and is optimized from Solver-derived QA fidelity, cycle-consistency, diversity, and contradiction rewards.

BLIP3o-8Bdiffusion generation
BAGELrectified-flow, 7B active
VARGPT-v1.1autoregressive, 7B+2B

Key ideas

What makes label-free self-evolution work.

Three components turn raw images into a stable training signal for both tasks.

1

Solver Token Entropy (STE)

A token-level uncertainty score from the Solver's next-token distributions, percentile-normalized over a rolling window. It recovers a continuous difficulty measure precisely when sample-level self-consistency goes degenerate.

2

Solver-as-evaluator generation

Generated images are scored by QA fidelity (do Solver answers match the source image?) and cycle-consistent captioning (does the caption round-trip?), giving a multi-scale, label-free generation reward.

3

Solver-mediated coupling

The loops are coupled only through the Solver: understanding updates improve the evaluator, and a stronger evaluator supplies sharper generation rewards — an asymmetric link, not a symmetric gradient exchange.

Results

Consistent base-versus-ours gains under matched conditions.

The strongest evidence is the within-backbone delta: every base checkpoint and its self-evolved counterpart are evaluated under the same inference setting.

Visual understanding

Eight benchmarks · our rows highlighted, Δ vs. base
Method Params MMMU MMBench TextVQA SEED RWQA MM-Vet MME-P MME-C
Unified understanding and generation models
Janus-Pro-7B7B41.079.272.150.01567.1
Emu331.658.564.768.257.437.21243.8266.1
TokenFlow43.276.862.372.656.648.21551.1371.1
MetaMorph41.875.260.571.858.3
Self-evolving methods
UniGame7B43.883.2
SUDER7B80.171.9
UniCorn7B active53.884.11660.0677.0
ILLUME7B38.275.172.172.937.01445.3
Our backbones (base → ours)
BLIP3o-8B8B50.683.583.177.569.066.61682.6647.1
BLIP3o-8B (Ours)8B52.8+2.286.1+2.685.2+2.179.4+1.970.9+1.968.7+2.11698.4+15.8660.3+13.2
BAGEL7B active55.385.086.079.371.267.21687.0701.0
BAGEL (Ours)7B active58.8+3.587.1+2.188.5+2.581.8+2.573.9+2.769.5+2.31701.7+14.7715.9+14.9
VARGPT-v1.17B+2B48.681.082.076.167.551.91678.3592.9
VARGPT-v1.1 (Ours)7B+2B51.6+3.083.7+2.784.8+2.879.2+3.171.1+3.654.0+2.11695.7+17.4606.4+13.5
Same recipe, data, roles, and schedule across backbones.

Image generation · GenEval

Six subcategories + overall · our rows highlighted, Δ vs. base
Method Params Single Two obj. Counting Colors Position Color attr. Overall
Unified understanding and generation models
Show-o98856781285569
Janus-Pro-7B7B99895990796680
Emu398713481172154
TokenFlow97664084172655
Self-evolving methods
UniGame7B99916293806882
SUDER7B99897092827184
UniRL (SFT)99936289556877
UniCorn7B active99948088617382
ILLUME7B99864571392861
Our backbones (base → ours)
BLIP3o-8B8B100856392907484
BLIP3o-8B (Ours)8B99-193+871+894+290+075+187+3
BAGEL7B active99948188646382
BAGEL (Ours)7B active99+095+187+690+267+372+985+3
VARGPT-v1.17B+2B96534883132153
VARGPT-v1.1 (Ours)7B+2B97+159+656+885+215+224+356+3
A uniform +3-point overall gain on all three backbones, concentrated on composition-heavy subcategories.

Mechanism evidence

Diagnostics explain where the gains come from.

The diagnostic figures show complementary understanding signals, asymmetric loop coupling, and stable reward trajectories across model families.

Signal analysis showing STE difficulty, self-consistency entropy, and generation reward trends.
Signal analysis. STE and prompt-perturbed self-consistency occupy complementary difficulty regions, while QA fidelity and cycle consistency rise during generation-side training.
Loop coupling analysis comparing joint training, understanding-only, and generation-only variants.
Loop coupling. Joint alternating training outperforms isolated and staged controls. Generation-only training improves GenEval but leaves Solver-side understanding at the base level.
Training dynamics over 10k steps across BLIP3o, BAGEL, and VARGPT-v1.1.
Training dynamics. Across diffusion, flow-matching, and autoregressive backbones, generation rewards rise without collapse and understanding signals stabilize during the 10k-step run.

In context

Matched gains without judges or curated data.

The full result tables above already place our runs alongside prior unified and self-evolving methods. The distinction is not only the scores, but what we deliberately avoid using to reach them.

No curated supervision

Only raw images — no VQA labels, captions, or preference data at any stage of post-training.

No task-trained judge

No GPT-as-judge, reward model, or trained verifier; the Solver evaluates its own generations.

One portable recipe

Identical roles, rewards, and schedule across diffusion, rectified-flow, and autoregressive backbones.

Qualitative results

Self-evolving training changes both answers and generated images.

Representative examples show corrections in object recognition, action understanding, and spatial reasoning, along with better color, count, and compositional fidelity in generation.

Qualitative before-and-after examples for understanding and generation across the three backbones.
Overview. Paired before/after results for understanding and generation across all three backbones.
Understanding corrections across BLIP3o-8B, BAGEL, and VARGPT-1.1 on grounding, counting, and spatial reasoning questions.
Understanding. Before/after answers on fine-grained recognition, counting, text grounding, and spatial reasoning. Red marks pre-training errors; green marks self-evolved corrections shared across all three backbones.
Natural compositional generation prompts showing improved attribute binding, color grounding, and positioning.
Generation — natural prompts. Compositional prompts (attribute–object binding, local color grounding, multi-person interaction, relative positioning) where self-evolving training tightens prompt fidelity.
Controlled GenEval-style generation prompts covering counting, attribute binding, multi-object composition, and spatial arrangement.
Generation — GenEval-style prompts. Controlled prompts for counting, cross-object attribute binding, multi-object composition, and spatial arrangement, isolating the skills GenEval measures.

Citation

Cite this work.

Copy the BibTeX entry below.

@article{thawkar2026asksolvegenerate,
  title={Ask, Solve, Generate: Self-Evolving Unified Multimodal Understanding and Generation via Self-Consistency Rewards},
  author={Thawkar, Ritesh and Venkatraman, Shravan and Thawakar, Omkar and Shaker, Abdelrahman and Khan, Fahad and Cholakkal, Hisham and Khan, Salman and Anwer, Rao Muhammad},
  journal={arXiv preprint arXiv:2606.27376},
  year={2026}
}