CoME-VL: Scaling Complementary
Multi-Encoder
Vision-Language Learning

How effectively can we combine two complementary vision encoders for vision-language modeling? We show that fusing SigLIP2 (strong at understanding) with DINOv3 (strong at grounding) through entropy-guided layer selection and orthogonality-regularized mixing yields consistent gains on both tasks.

Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

TL;DR

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL (Complementary Multi-Encoder Vision-Language), a modular fusion framework that integrates a contrastively trained vision encoder with DINO as self-supervised encoders. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin.

SigLIP2 → Understanding

Contrastive encoder excels at semantic understanding — chart, diagram, table, and document comprehension tasks.

DINOv3 → Grounding

Self-supervised encoder captures fine-grained spatial cues crucial for pointing, counting, and object localization.

Entropy-Guided Selection

Layer-wise entropy analysis reveals which layers are informative — guiding optimal multi-scale feature selection from each encoder.

Orthogonal Fusion + RoPE

Orthogonality-regularized mixing removes redundancy; RoPE cross-attention spatially aligns heterogeneous token grids.

Entropy Analysis & Performance Gains

SigLIP2 and DINOv3 exhibit distinct entropy profiles across depth — SigLIP2 maintains high entropy (rich semantic diversity) while DINOv3’s deeper layers concentrate on spatially discriminative regions. By leveraging all SigLIP2 layers for understanding and DINOv3 layers 10–23 for grounding, CoME-VL harnesses the best of both worlds.

CoME-VL Teaser: Entropy analysis and performance comparison
Figure 1. CoME-VL overview — complementary encoder fusion guided by layer-wise entropy analysis yields consistent improvements across understanding and grounding benchmarks.

Why Two Encoders?

Contrastive (SigLIP2) and self-supervised (DINOv3) encoders learn fundamentally different visual representations. SigLIP2 excels at semantic alignment with language, while DINOv3 captures spatially coherent, fine-grained features ideal for grounding. Combining both unlocks complementary strengths.

Complementary features analysis of SigLIP2 and DINOv3
Figure 2. Complementary feature analysis — SigLIP2 and DINOv3 encode qualitatively different information. Their fusion provides richer visual representations for downstream vision-language tasks.

CoME-VL Framework

CoME-VL integrates SigLIP2 and DINOv3 through a modular fusion pipeline: entropy-guided layer selection identifies the most informative features, orthogonality-regularized projections reduce redundancy, and RoPE-enhanced cross-attention aligns heterogeneous token grids into compact visual tokens.

CoME-VL architecture diagram
Figure 3. CoME-VL architecture — a modular fusion framework that combines contrastive and self-supervised vision encoders through entropy-guided layer selection, orthogonality-constrained projections, and RoPE cross-attention.

Semantic Feature Analysis

DINOv3 maintains spatially coherent object-level attention throughout, while SigLIP2 transitions from broad spatial coverage in early layers to focused semantic discrimination in deeper layers.