LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar^1*, Dinura Dissanayake^1*, Ketan More^1*, Ritesh Thawkar^1*, Ahmed Heakl^1*, Noor Ahsan^1*, Yuhao Li^1*, Mohammed Zumri^1*, Jean Lahoud^1*, Rao Muhammad Anwer¹, Hisham Cholakkal¹, Ivan Laptev¹, Mubarak Shah², Fahad Shahbaz Khan¹,³, Salman Khan^1,⁴,

* Equal Contributions

¹Mohamed bin Zayed University of AI, ²University of Central Florida, ³Linköping University, ⁴Australian National University

Abstract

Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual context where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning through three key contributions: a new step-by-step visual reasoning benchmark along with a novel evaluation metric, and an improved visual reasoning LLM, named LlamaV-o1, that is trained with curriculum learning.

First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-turn reasoning tasks. The benchmark provides a diverse set of challenges, enabling robust evaluation of LLMs’ abilities to perform accurate and interpretable reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new visual reasoning model LlamaV-o1, trained using a multiturn curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-turn reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models, including the recent Llava-CoT, across multiple metrics. Notably, LlamaV-o1 exhibits better interpretability, robustness, and adaptability to complex visual reasoning tasks. Our benchmark, model and code are publicly available.

Leaderboard on VRC-Bench

Performance comparison of different closed-and open-source LMMs on VRC-Bench.

	#	Model	Source	Steps Accuracy	Final Answer Accuracy
Closed-Source Models	1	GPT-4o 🥇	Link	76.68	59.28
	2	Gemini-2.0 Flash 🥈	Link	74.08	61.16
	3	GPT-4o-mini 🥉	Link	74.05	56.39
	4	Claude-3.5 Sonnet	Link	72.12	61.35
	5	Gemini-1.5-Pro	Link	72.12	61.35
	6	Gemini-1.5-Flash	Link	71.86	54.99
Open-Source Models	1	Llama-3.2-Vision-Instruct	Link	56.37	48.40
	2	Mulberry	Link	63.86	51.90
	3	LlaVa-CoT	Link	66.21	54.09
	4	LlamaV-o1 (Ours)	Link	68.93	56.49

In-The-Wild Evaluations

Performance comparison on six benchmark datasets (MMStar, MMBench, MMVet, MathVista, AI2D , and Hallusion) along with their average scores. The comparison includes both close-source and open-source models.
Model	MMStar	MMBench	MMVet	MathVista	AI2D	Hallusion	Average
Close-Source
GPT-4o-0806	66.0	82.4	80.8	62.7	84.7	54.2	71.8
Claude3.5-Sonnet-0620	64.2	75.4	68.7	61.6	80.2	49.9	66.7
Gemini-1.5-Pro	56.4	71.5	71.3	57.7	79.1	45.6	63.6
GPT-4o-mini-0718	54.9	76.9	74.6	52.4	77.8	46.1	63.8
Open-Source
internVL2-8B	62.50	77.40	56.90	58.30	83.60	45.00	64.00
Ovis1.5-Gemma2-9B	58.70	76.30	50.90	65.60	84.50	48.20	64.00
MiniCPM-V2.6-8B	57.10	75.70	56.30	60.60	82.10	48.10	63.30
Llama-3.2-90B-Vision-Inst	51.10	76.80	74.10	58.30	69.50	44.10	62.30
VILA-1.5-40B	53.20	75.30	44.40	49.50	77.80	40.90	56.90
Llava-CoT	57.60	75.00	60.30	54.80	85.70	47.80	63.50
Our Models
Llama-3.2-11B-Vision-Inst	49.80	65.80	57.60	48.60	77.30	40.30	56.90
LlamaV-o1 (Ours)	59.53	79.89	65.40	54.40	81.24	63.51	67.33

Ablations

The impact of proposed contributions on multimodal reasoning tasks across six benchmarks.
Model	MMStar	MMBench	MMVet	MathVista	AI2D	Hallusion	Average
Llama-3.2-11B-Vision-Inst (base)	49.80	65.80	57.60	48.60	77.30	40.30	56.90
+ Curriculum with Multi-Turn CoT Reasoning	58.13	79.55	61.88	53.20	80.18	63.51	66.08
+ Beam-Search Beam Search	59.53	79.89	65.40	54.40	81.24	63.51	67.33

Llava-CoT using Stage-level Beam Search (MMVet Benchmark)
Inference Scaling	# Beams	MMVet Score	Time (GPU Hours)
No Scaling	1	60.3	3.8
Stage-level	2	61.7	20.1
Stage-level	3	62.3	31.5
Stage-level	4	62.9	46.1

Our Approach using Best-of-N Beam Search (MMVet Benchmark)
Inference Scaling	# Beams	MMVet Score	Time (GPU Hours)
No Scaling	1	63.63	2.7
Beam-Search	2	64.26	4.8
Beam-Search	3	64.92	5.7
Beam-Search	4	65.40	6.1

BibTeX

@article{thawakar2025llamavo1, title = {LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs}, author = {Thawakar, Omkar and Dissanayake, Dinura and More, Ketan and Thawkar, Ritesh and Heakl, Ahmed and Ahsan, Noor and Li, Yuhao and Zumri, Mohammed and Lahoud, Jean and Anwer, Rao Muhammad and Cholakkal, Hisham and Laptev, Ivan and Shah, Mubarak and Khan, Fahad Shahbaz and Khan, Salman}, journal = {arXiv preprint arXiv:2501.06186}, year = {2025}, url = {https://arxiv.org/abs/2501.06186} }

LlamaV-o1