How effectively can we combine two complementary vision encoders for vision-language modeling? We show that fusing SigLIP2 (strong at understanding) with DINOv3 (strong at grounding) through entropy-guided layer selection and orthogonality-regularized mixing yields consistent gains on both tasks.
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Contrastive encoder excels at semantic understanding — chart, diagram, table, and document comprehension tasks.
Self-supervised encoder captures fine-grained spatial cues crucial for pointing, counting, and object localization.
Layer-wise entropy analysis reveals which layers are informative — guiding optimal multi-scale feature selection from each encoder.
Orthogonality-regularized mixing removes redundancy; RoPE cross-attention spatially aligns heterogeneous token grids.
Contrastive (SigLIP2) and self-supervised (DINOv3) encoders learn fundamentally different visual representations. SigLIP2 excels at semantic alignment with language, while DINOv3 captures spatially coherent, fine-grained features ideal for grounding. Combining both unlocks complementary strengths.
CoME-VL integrates SigLIP2 and DINOv3 through a modular fusion pipeline: entropy-guided layer selection identifies the most informative features, orthogonality-regularized projections reduce redundancy, and RoPE-enhanced cross-attention aligns heterogeneous token grids into compact visual tokens.
DINOv3 maintains spatially coherent object-level attention throughout, while SigLIP2 transitions from broad spatial coverage in early layers to focused semantic discrimination in deeper layers.