CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Abstract

TL;DR

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL (Complementary Multi-Encoder Vision-Language), a modular fusion framework that integrates a contrastively trained vision encoder with DINO as self-supervised encoders. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin.

SigLIP2 → Understanding

Contrastive encoder excels at semantic understanding — chart, diagram, table, and document comprehension tasks.

DINOv3 → Grounding

Self-supervised encoder captures fine-grained spatial cues crucial for pointing, counting, and object localization.

Entropy-Guided Selection

Layer-wise entropy analysis reveals which layers are informative — guiding optimal multi-scale feature selection from each encoder.

Orthogonal Fusion + RoPE

Orthogonality-regularized mixing removes redundancy; RoPE cross-attention spatially aligns heterogeneous token grids.

Overview

Entropy Analysis & Performance Gains

SigLIP2 and DINOv3 exhibit distinct entropy profiles across depth — SigLIP2 maintains high entropy (rich semantic diversity) while DINOv3’s deeper layers concentrate on spatially discriminative regions. By leveraging all SigLIP2 layers for understanding and DINOv3 layers 10–23 for grounding, CoME-VL harnesses the best of both worlds.

CoME-VL Teaser: Entropy analysis and performance comparison

Figure 1. CoME-VL overview — complementary encoder fusion guided by layer-wise entropy analysis yields consistent improvements across understanding and grounding benchmarks.

Motivation

Why Two Encoders?

Contrastive (SigLIP2) and self-supervised (DINOv3) encoders learn fundamentally different visual representations. SigLIP2 excels at semantic alignment with language, while DINOv3 captures spatially coherent, fine-grained features ideal for grounding. Combining both unlocks complementary strengths.

Complementary features analysis of SigLIP2 and DINOv3

Figure 2. Complementary feature analysis — SigLIP2 and DINOv3 encode qualitatively different information. Their fusion provides richer visual representations for downstream vision-language tasks.

Architecture

CoME-VL Framework

CoME-VL integrates SigLIP2 and DINOv3 through a modular fusion pipeline: entropy-guided layer selection identifies the most informative features, orthogonality-regularized projections reduce redundancy, and RoPE-enhanced cross-attention aligns heterogeneous token grids into compact visual tokens.

Figure 3. CoME-VL architecture — a modular fusion framework that combines contrastive and self-supervised vision encoders through entropy-guided layer selection, orthogonality-constrained projections, and RoPE cross-attention.

Analysis

Semantic Feature Analysis

DINOv3 maintains spatially coherent object-level attention throughout, while SigLIP2 transitions from broad spatial coverage in early layers to focused semantic discrimination in deeper layers.

DINOv3 — Layer-wise attention rollout showing spatially coherent object-level focus.

SigLIP2 — Layer-wise attention rollout showing transition from spatial to semantic focus.

Experiments

Benchmark Results

CoME-VL achieves consistent improvements across diverse vision-language benchmarks, outperforming both single-encoder baselines and competitive multi-encoder alternatives.

+4.9%

↑ avg. improvement

Understanding Tasks

+5.4%

↑ avg. improvement

Grounding Tasks

92.57

Top Accuracy

RefCOCO Benchmark

Main Comparison

Model	Chart	Diagrams	Tables	Others	Counting	Pointing
LLAVA1.5-7B	20.31	28.32	20.80	29.68	78.27	-
LLAVA1.5-13B	23.33	33.59	23.24	34.86	70.95	-
LLAVA-Mistral 7B	22.16	33.98	26.95	48.14	77.82	-
Intern-VL2 8B	57.71	72.85	73.82	90.62	74.05	-
QWEN2-VL 7B	45.21	64.25	61.13	86.91	57.42	-
Pixtral-12B	38.28	54.00	63.96	64.94	71.66	-
Paligemma-3B	16.50	26.26	20.80	20.11	8.57	-
Kosmos-2 8B	7.81	8.88	12.50	8.00	26.19	-
Instruct-BLIP 7B	13.28	17.08	16.60	10.54	36.19	-
Phi3 7B	10.54	9.37	9.57	7.22	12.61	-
GLM-4V 9B	40.23	58.65	54.12	84.37	84.76	-
Molmo2 7B	52.39	62.41	66.25	76.26	83.31	53.79 / 68.94
CoME-VL 7B (Ours)	57.24	66.94	70.75	81.84	87.83	58.56 / 75.94

RefCOCO Benchmark

RefCOCO	val	testA	testB
Molmo	0.10	0.27	0.27
Clip-to-DINO	91.73	94.06	88.85
Qwen-VL	89.36	92.23	85.36
CoME-VL (Ours)	92.57	95.36	90.51

Component Contribution Analysis

Figure 5. Stacked performance breakdown showing additive gains from RoPE and OL components.

Performance Analysis

Model	Chart	Diagrams	Tables	Others	Counting	Pointing (@3/5px)	Avg. Time
Molmo	52.39	62.41	66.25	76.26	83.31	53.79 / 68.94	1.26s
Siglip(0-22) + Dino(0-9)	56.17	66.12	69.86	81.12	86.97	56.68 / 74.59	1.37s
Siglip(22-27) + Dino(0-9)	54.96	63.72	68.28	77.16	84.23	52.41 / 67.65	1.33s
Siglip(0-22) + Dino(10-23)	56.91	66.46	70.02	81.46	87.67	57.22 / 75.13	1.40s
Siglip(22-27) + Dino(10-23)	56.06	65.98	69.72	80.88	87.21	56.95 / 74.87	1.34s
Siglip(0-27) + Dino(10-23)	57.24	66.94	70.75	81.84	87.83	58.56 / 75.94	1.52s

Applications

Supported Tasks

CoME-VL supports a wide range of vision-language tasks, leveraging the complementary strengths of both encoders for superior performance across understanding and grounding domains.

Figure 7. CoME-VL supports diverse downstream tasks including visual question answering, document understanding, chart comprehension, counting, pointing, and object detection.

Citation

BibTeX

If you find our work useful, please cite:

@article{deria2026come,
  title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning},
  author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
  journal={arXiv preprint arXiv:2604.03231},
  year={2026}
}

TL;DR

SigLIP2 → Understanding

DINOv3 → Grounding

Entropy-Guided Selection

Orthogonal Fusion + RoPE

Entropy Analysis & Performance Gains

Why Two Encoders?

CoME-VL Framework

Semantic Feature Analysis

Benchmark Results

Main Comparison

RefCOCO Benchmark

Component Contribution Analysis

Performance Analysis

Qualitative Results

Supported Tasks

BibTeX