CoME-VL: Scaling Complementary
Multi-Encoder
Vision-Language Learning

How effectively can we combine two complementary vision encoders for vision-language modeling? We show that fusing SigLIP2 (strong at understanding) with DINOv3 (strong at grounding) through entropy-guided layer selection and orthogonality-regularized mixing yields consistent gains on both tasks.

Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

TL;DR

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL (Complementary Multi-Encoder Vision-Language), a modular fusion framework that integrates a contrastively trained vision encoder with DINO as self-supervised encoders. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin.

SigLIP2 → Understanding

Contrastive encoder excels at semantic understanding — chart, diagram, table, and document comprehension tasks.

DINOv3 → Grounding

Self-supervised encoder captures fine-grained spatial cues crucial for pointing, counting, and object localization.

Entropy-Guided Selection

Layer-wise entropy analysis reveals which layers are informative — guiding optimal multi-scale feature selection from each encoder.

Orthogonal Fusion + RoPE

Orthogonality-regularized mixing removes redundancy; RoPE cross-attention spatially aligns heterogeneous token grids.

Entropy Analysis & Performance Gains

SigLIP2 and DINOv3 exhibit distinct entropy profiles across depth — SigLIP2 maintains high entropy (rich semantic diversity) while DINOv3’s deeper layers concentrate on spatially discriminative regions. By leveraging all SigLIP2 layers for understanding and DINOv3 layers 10–23 for grounding, CoME-VL harnesses the best of both worlds.

CoME-VL Teaser: Entropy analysis and performance comparison
Figure 1. CoME-VL overview — complementary encoder fusion guided by layer-wise entropy analysis yields consistent improvements across understanding and grounding benchmarks.

Why Two Encoders?

Contrastive (SigLIP2) and self-supervised (DINOv3) encoders learn fundamentally different visual representations. SigLIP2 excels at semantic alignment with language, while DINOv3 captures spatially coherent, fine-grained features ideal for grounding. Combining both unlocks complementary strengths.

Complementary features analysis of SigLIP2 and DINOv3
Figure 2. Complementary feature analysis — SigLIP2 and DINOv3 encode qualitatively different information. Their fusion provides richer visual representations for downstream vision-language tasks.

CoME-VL Framework

CoME-VL integrates SigLIP2 and DINOv3 through a modular fusion pipeline: entropy-guided layer selection identifies the most informative features, orthogonality-regularized projections reduce redundancy, and RoPE-enhanced cross-attention aligns heterogeneous token grids into compact visual tokens.

CoME-VL architecture diagram
Figure 3. CoME-VL architecture — a modular fusion framework that combines contrastive and self-supervised vision encoders through entropy-guided layer selection, orthogonality-constrained projections, and RoPE cross-attention.

Semantic Feature Analysis

DINOv3 maintains spatially coherent object-level attention throughout, while SigLIP2 transitions from broad spatial coverage in early layers to focused semantic discrimination in deeper layers.

Benchmark Results

CoME-VL achieves consistent improvements across diverse vision-language benchmarks, outperforming both single-encoder baselines and competitive multi-encoder alternatives.

+4.9%
↑ avg. improvement
Understanding Tasks
+5.4%
↑ avg. improvement
Grounding Tasks
92.57
Top Accuracy
RefCOCO Benchmark

Main Comparison

Model Chart Diagrams Tables Others Counting Pointing
LLAVA1.5-7B20.3128.3220.8029.6878.27-
LLAVA1.5-13B23.3333.5923.2434.8670.95-
LLAVA-Mistral 7B22.1633.9826.9548.1477.82-
Intern-VL2 8B57.7172.8573.8290.6274.05-
QWEN2-VL 7B45.2164.2561.1386.9157.42-
Pixtral-12B38.2854.0063.9664.9471.66-
Paligemma-3B16.5026.2620.8020.118.57-
Kosmos-2 8B7.818.8812.508.0026.19-
Instruct-BLIP 7B13.2817.0816.6010.5436.19-
Phi3 7B10.549.379.577.2212.61-
GLM-4V 9B40.2358.6554.1284.3784.76-
Molmo2 7B52.3962.4166.2576.2683.3153.79 / 68.94
CoME-VL 7B (Ours)57.2466.9470.7581.8487.8358.56 / 75.94

RefCOCO Benchmark

RefCOCOvaltestAtestB
Molmo0.100.270.27
Clip-to-DINO91.7394.0688.85
Qwen-VL89.3692.2385.36
CoME-VL (Ours)92.5795.3690.51

Component Contribution Analysis

Component Contribution Analysis: Stacked Performance Breakdown

Figure 5. Stacked performance breakdown showing additive gains from RoPE and OL components.

Performance Analysis

Model Chart Diagrams Tables Others Counting Pointing (@3/5px) Avg. Time
Molmo 52.39 62.41 66.25 76.26 83.31 53.79 / 68.94 1.26s
Siglip(0-22) + Dino(0-9) 56.17 66.12 69.86 81.12 86.97 56.68 / 74.59 1.37s
Siglip(22-27) + Dino(0-9) 54.96 63.72 68.28 77.16 84.23 52.41 / 67.65 1.33s
Siglip(0-22) + Dino(10-23) 56.91 66.46 70.02 81.46 87.67 57.22 / 75.13 1.40s
Siglip(22-27) + Dino(10-23) 56.06 65.98 69.72 80.88 87.21 56.95 / 74.87 1.34s
Siglip(0-27) + Dino(10-23) 57.24 66.94 70.75 81.84 87.83 58.56 / 75.94 1.52s

Qualitative Results

CoME-VL produces more accurate and spatially precise predictions across diverse vision-language tasks, including visual question answering, object grounding, and document understanding.

Qualitative comparison of CoME-VL results
Figure 6. Qualitative comparison — CoME-VL generates more contextually accurate and spatially grounded responses compared to single-encoder baselines.

Supported Tasks

CoME-VL supports a wide range of vision-language tasks, leveraging the complementary strengths of both encoders for superior performance across understanding and grounding domains.

Downstream tasks supported by CoME-VL
Figure 7. CoME-VL supports diverse downstream tasks including visual question answering, document understanding, chart comprehension, counting, pointing, and object detection.

BibTeX

If you find our work useful, please cite:

@article{comevl2026,
  title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language},
  author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
  journal={arXiv preprint},
  year={2026}
}