How effectively can we combine two complementary vision encoders for vision-language modeling? We show that fusing SigLIP2 (strong at understanding) with DINOv3 (strong at grounding) through entropy-guided layer selection and orthogonality-regularized mixing yields consistent gains on both tasks.
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Contrastive encoder excels at semantic understanding — chart, diagram, table, and document comprehension tasks.
Self-supervised encoder captures fine-grained spatial cues crucial for pointing, counting, and object localization.
Layer-wise entropy analysis reveals which layers are informative — guiding optimal multi-scale feature selection from each encoder.
Orthogonality-regularized mixing removes redundancy; RoPE cross-attention spatially aligns heterogeneous token grids.
Contrastive (SigLIP2) and self-supervised (DINOv3) encoders learn fundamentally different visual representations. SigLIP2 excels at semantic alignment with language, while DINOv3 captures spatially coherent, fine-grained features ideal for grounding. Combining both unlocks complementary strengths.
CoME-VL integrates SigLIP2 and DINOv3 through a modular fusion pipeline: entropy-guided layer selection identifies the most informative features, orthogonality-regularized projections reduce redundancy, and RoPE-enhanced cross-attention aligns heterogeneous token grids into compact visual tokens.
DINOv3 maintains spatially coherent object-level attention throughout, while SigLIP2 transitions from broad spatial coverage in early layers to focused semantic discrimination in deeper layers.
CoME-VL achieves consistent improvements across diverse vision-language benchmarks, outperforming both single-encoder baselines and competitive multi-encoder alternatives.
| Model | Chart | Diagrams | Tables | Others | Counting | Pointing |
|---|---|---|---|---|---|---|
| LLAVA1.5-7B | 20.31 | 28.32 | 20.80 | 29.68 | 78.27 | - |
| LLAVA1.5-13B | 23.33 | 33.59 | 23.24 | 34.86 | 70.95 | - |
| LLAVA-Mistral 7B | 22.16 | 33.98 | 26.95 | 48.14 | 77.82 | - |
| Intern-VL2 8B | 57.71 | 72.85 | 73.82 | 90.62 | 74.05 | - |
| QWEN2-VL 7B | 45.21 | 64.25 | 61.13 | 86.91 | 57.42 | - |
| Pixtral-12B | 38.28 | 54.00 | 63.96 | 64.94 | 71.66 | - |
| Paligemma-3B | 16.50 | 26.26 | 20.80 | 20.11 | 8.57 | - |
| Kosmos-2 8B | 7.81 | 8.88 | 12.50 | 8.00 | 26.19 | - |
| Instruct-BLIP 7B | 13.28 | 17.08 | 16.60 | 10.54 | 36.19 | - |
| Phi3 7B | 10.54 | 9.37 | 9.57 | 7.22 | 12.61 | - |
| GLM-4V 9B | 40.23 | 58.65 | 54.12 | 84.37 | 84.76 | - |
| Molmo2 7B | 52.39 | 62.41 | 66.25 | 76.26 | 83.31 | 53.79 / 68.94 |
| CoME-VL 7B (Ours) | 57.24 | 66.94 | 70.75 | 81.84 | 87.83 | 58.56 / 75.94 |
| RefCOCO | val | testA | testB |
|---|---|---|---|
| Molmo | 0.10 | 0.27 | 0.27 |
| Clip-to-DINO | 91.73 | 94.06 | 88.85 |
| Qwen-VL | 89.36 | 92.23 | 85.36 |
| CoME-VL (Ours) | 92.57 | 95.36 | 90.51 |
Figure 5. Stacked performance breakdown showing additive gains from RoPE and OL components.
| Model | Chart | Diagrams | Tables | Others | Counting | Pointing (@3/5px) | Avg. Time |
|---|---|---|---|---|---|---|---|
| Molmo | 52.39 | 62.41 | 66.25 | 76.26 | 83.31 | 53.79 / 68.94 | 1.26s |
| Siglip(0-22) + Dino(0-9) | 56.17 | 66.12 | 69.86 | 81.12 | 86.97 | 56.68 / 74.59 | 1.37s |
| Siglip(22-27) + Dino(0-9) | 54.96 | 63.72 | 68.28 | 77.16 | 84.23 | 52.41 / 67.65 | 1.33s |
| Siglip(0-22) + Dino(10-23) | 56.91 | 66.46 | 70.02 | 81.46 | 87.67 | 57.22 / 75.13 | 1.40s |
| Siglip(22-27) + Dino(10-23) | 56.06 | 65.98 | 69.72 | 80.88 | 87.21 | 56.95 / 74.87 | 1.34s |
| Siglip(0-27) + Dino(10-23) | 57.24 | 66.94 | 70.75 | 81.84 | 87.83 | 58.56 / 75.94 | 1.52s |
CoME-VL produces more accurate and spatially precise predictions across diverse vision-language tasks, including visual question answering, object grounding, and document understanding.
CoME-VL supports a wide range of vision-language tasks, leveraging the complementary strengths of both encoders for superior performance across understanding and grounding domains.
If you find our work useful, please cite:
@article{comevl2026,
title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language},
author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
journal={arXiv preprint},
year={2026}
}