VISE logo

Paying More Attention to Visual Tokens
in Self-Evolving Large Multimodal Models

ECCV 2026

1Mohamed bin Zayed University of Artificial Intelligence  •  2Aalto University
3Australian National University  •  4Linköping University
VISE overview teaser

Prior self-evolving methods use specialist roles optimized for answer consistency. On chart queries (minimal visual dependence) both prior work and VISE answer correctly, but on real-scene understanding prior methods fall back on statistically plausible guesses (a “ramp surface”), while VISE reads the actual evidence (a “metal ledge”).

Abstract

Recently, self-evolving large multimodal models (LMMs) have received attention for improving visual reasoning in a purely unsupervised setting. However, multi-role self-play and self-consistency reward schemes in existing self-evolving LMMs optimize answer agreement without ensuring the decoder attends to visual content, relying instead on statistical language priors to produce self-consistent outputs. This leads to a persistent failure mode we term visual under-conditioning, where the decoder relies on language priors rather than the image during generation, manifesting as insufficient attention to visual tokens. As a result, current self-evolving LMMs struggle on vision–language understanding tasks such as image captioning and visual question answering. To address this, we propose VISE (Visual Invariance Self-Evolution), a purely unsupervised self-evolving framework that directly regularizes the model's visual conditioning policy through two complementary invariance-based rewards: a geometric invariance reward that enforces spatial consistency under known transformations, and a semantic invariance reward that penalizes evidence-agnostic generation by requiring the model to recognize the absence of evidence when predicted regions are perturbed. VISE operates within a single model without specialist roles, external reward models, or annotations, and is trained on raw unlabeled images. Experiments on 18 benchmarks demonstrate the efficacy of our approach. Using Qwen3-VL-2B as the base model, VISE achieves gains of +16.85 CIDEr on COCO and +19.66 CIDEr on TextCaps, reduces object hallucination by 5.0 CHAIR-I points, and generalizes across multiple model families and scales.

Method

VISE runs entirely within a single model. Given a raw unlabeled image, the model generates a localization query and predicts a bounding box. Two complementary invariance branches then produce a reward computed entirely from the model's own predictions, which is optimized with KL-regularized REINFORCE against a frozen reference policy.

VISE architecture

The Geometric branch transforms the view, re-predicts a box, and rewards agreement with the analytically projected original box. The Semantic branch ghosts the predicted region and rewards the model only if it detects the object before perturbation and not after.

  Geometric Invariance  Rgeo

A correctly-conditioned model should localize the same object consistently under a known spatial transform (affine, crop, or flip). We project the original box through the transform's matrix and reward overlap with the box predicted on the transformed view.

Rgeo = ( GIoU(Bproj, Bnew) + 1 ) / 2

  Semantic Invariance  Rsem

If the predicted region is blurred away (“ghosting”), the evidence for the object disappears, so a well-conditioned model should notice. We reward the model only when the object is judged visible before ghosting and not visible after.

Rsem = 1  if  v=1  and  ṽ=0,  else  0
Composite reward:  R = 0.5·Rgeo + 0.5·Rsem

Results

VISE improves all 18 benchmarks (captioning, VQA, reasoning, and hallucination) with no task tradeoffs, across four scales (2B/4B/8B/32B) and four backbone families (Qwen3-VL, InternVL3, Gemma-3, Llama-3.2-Vision). Prior consistency-based methods trained on math/science images regress on captioning, while VISE achieves the largest captioning gains and remains competitive on reasoning.

Performance deltas vs base (Qwen3-VL-2B)

+16.85

COCO CIDEr (2B)

+19.66

TextCaps CIDEr (2B)

−5.00

CHAIR-I hallucination (2B)

+2.84%

visual-token attention (2B)

Results Across Model Scales

Base vs. VISE on Qwen3-VL at three scales. VISE improves every benchmark with no task tradeoffs. For CHAIR (↓), lower is better. All numbers are from the paper.

BenchmarkBaseVISE
Image Captioning (CIDEr)
COCO21.5438.39
NoCaps19.5234.25
Flickr30k26.0942.64
TextCaps22.2041.86
Reasoning & VQA (Accuracy)
ScienceQA79.4283.61
MMMU38.9240.67
InfoVQA69.0271.43
AI2D73.6776.42
ChartQA79.1680.08
GQA58.2559.41
Hallucination
CHAIR-I ↓13.218.21
CHAIR-S ↓45.9640.51
POPE89.0190.03
BenchmarkBaseVISE
Image Captioning (CIDEr)
COCO27.3539.65
NoCaps22.3634.97
Flickr30k31.1037.17
TextCaps34.5438.59
Reasoning & VQA (Accuracy)
ScienceQA87.5190.04
MMMU45.1748.89
InfoVQA77.7381.45
AI2D80.1082.16
ChartQA83.1884.96
GQA60.3261.82
Hallucination
CHAIR-I ↓12.9111.90
CHAIR-S ↓44.5843.02
POPE89.7389.86
BenchmarkBaseVISE
Image Captioning (CIDEr)
COCO29.0138.49
NoCaps24.4634.98
Flickr30k34.0238.62
TextCaps36.2138.42
Reasoning & VQA (Accuracy)
ScienceQA90.8892.81
MMMU50.1252.69
InfoVQA81.2382.83
AI2D83.3184.10
ChartQA84.8785.41
GQA61.5462.43
Hallucination
CHAIR-I ↓11.2010.84
CHAIR-S ↓43.4241.53
POPE89.9190.32

Qualitative Comparisons

Why it works: more attention to visual tokens

Generation-time visual attention per layer

Generation-time visual attention per decoder layer. VISE-trained models (orange) assign more attention to image tokens across mid-to-late decoder layers (mean +2.84% on 2B, +2.56% on 4B), reflecting the shift from language-prior-driven to image-conditioned decoding.

Per-layer CKA similarity

Per-layer CKA similarity between original and geometrically augmented views. VISE gains are confined to final layers on 2B (peak Δ=+0.069 at layer 27) and span layers 19–33 on 4B (peak Δ=+0.253), localizing where invariance correction takes effect.

BibTeX

@inproceedings{venkatraman2026vise,
  title     = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models},
  author    = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and
               Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}