Logo LlamaV-o1

Rethinking Step-by-step Visual Reasoning in LLMs

Omkar Thawakar1*, Dinura Dissanayake1*, Ketan More1*, Ritesh Thawkar1*, Ahmed Heakl1*, Noor Ahsan1*, Yuhao Li1*, Mohammed Zumri1*, Jean Lahoud1*, Rao Muhammad Anwer1, Hisham Cholakkal1, Ivan Laptev1, Mubarak Shah2, Fahad Shahbaz Khan1,3, Salman Khan1, 4,

* Equal Contributions

1Mohamed bin Zayed University of AI, 2University of Central Florida, 3Linköping University, 4Australian National University

geometric reasoning

The figure illustrates a comprehensive dataset structure designed to evaluate diverse tasks across multiple domains. The dataset spans a wide range of categories, including mathematical and logical reasoning (e.g., MathVista with 231 samples and LogicVista with 158 samples), scientific reasoning (e.g., Science-QA with 83 samples), and visual perception (e.g., Blink-IQ-Test with 35 samples).


Additionally, it includes specialized areas such as medical imaging (e.g., MMMU-Medical with 29 samples), cultural and social understanding (e.g., ALM-Bench with 104 samples), and document understanding through OCR (e.g., Doc-VQA with 61 samples). By integrating tasks like chart and diagram comprehension (e.g., Chart-VQA with 107 samples), our dataset not only covers a broad spectrum of real-world applications but also expand LMM's ability to reason, perceive, and interpret complex multimodal information.


The right chart also presents a comparative evaluation of large multimodal models (LMMs) on the VRC-Bench, highlighting both final answer accuracy and step-by-step reasoning scores. The bar chart on the right demonstrates the performance of various models, such as GPT-4o, Gemini-2.0-Flash, Claude-3.5-Sonnet, and Llava-CoT, in handling complex reasoning tasks. Our benchmark evaluates models not only on their ability to generate accurate final answers but also on the coherence and logical flow of their reasoning steps. Our approach, LlamaV-o1, outperforms GPT-4o-mini, Gemini-1.5-Flash and Llava-CoT in the VRC-Bench, achieving superior results in final answer accuracy across complex multimodal reasoning tasks.

News Icon News

[2025-01-09]: Our VRC-Bench is now available on HuggingFace. We welcome all contributions and look forward to your participation!

Abstract

Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual context where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning through three key contributions: a new step-by-step visual reasoning benchmark along with a novel evaluation metric, and an improved visual reasoning LLM, named LlamaV-o1, that is trained with curriculum learning.

First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-turn reasoning tasks. The benchmark provides a diverse set of challenges, enabling robust evaluation of LLMs’ abilities to perform accurate and interpretable reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new visual reasoning model LlamaV-o1, trained using a multiturn curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-turn reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models, including the recent Llava-CoT, across multiple metrics. Notably, LlamaV-o1 exhibits better interpretability, robustness, and adaptability to complex visual reasoning tasks. Our benchmark, model and code are publicly available.

Teaser Figure

Comparison of the reasoning abilities of our model (LlamaV-o1) with closed-source Gemini-1.5-Flash and Claude-3.5-Sonnet on an example pattern recognition task. While Claude-3.5-Sonnet also concludes "none of the options," its reasoning steps lack full alignment with the observed logic (highlighted in red). Gemini-1.5-Flash demonstrates weaker reasoning with less coherence (highlighted in red). Our LlamaV-o1 provides better and systematic reasoning, identifying that option D follows the established pattern, thereby showcasing its logical reasoning.

Leaderboard on VRC-Bench

Performance comparison of different closed-and open-source LMMs on Logo VRC-Bench.

# Model Source Steps Accuracy Final Answer Accuracy
Closed-Source Models 1 GPT-4o 🥇 Link 76.68 59.28
2 Gemini-2.0 Flash 🥈 Link 74.08 61.16
3 GPT-4o-mini 🥉 Link 74.05 56.39
4 Claude-3.5 Sonnet Link 72.12 61.35
5 Gemini-1.5-Pro Link 72.12 61.35
6 Gemini-1.5-Flash Link 71.86 54.99
Open-Source Models 1 Llama-3.2-Vision-Instruct Link 56.37 48.40
2 Mulberry Link 63.86 51.90
3 LlaVa-CoT Link 66.21 54.09
4 LlamaV-o1 (Ours) Link 68.93 56.49
Category Scores

The comprehensive comparison of category-wise and overall performance scores achieved by various models on diverse reasoning tasks. The evaluation spans multiple domains, including Math & Logic Reasoning, Scientific Reasoning, Complex Visual Perception, Chart & Diagram Understanding, Medical Imaging, Social & Cultural Context, Visual Reasoning, and OCR & Document Understanding. The models assessed include GPT-4o, Claude-3.5-Sonnet, Gemini variants, LLAVA-CoT, and our proposed model. Our model demonstrates consistently superior performance in critical categories such as Math & Logic Reasoning, Chart & Diagram Understanding, and Medical Imaging, achieving a balanced improvement across both step-by-step reasoning (Step Scores) and final answer accuracy (Final Answer Scores). Compared to LLAVA-CoT, our approach excels in maintaining high accuracy across tasks while showcasing robustness and interpretability in multi-turn reasoning challenges.

In-The-Wild Evaluations

Performance comparison on six benchmark datasets (MMStar, MMBench, MMVet, MathVista, AI2D , and Hallusion) along with their average scores. The comparison includes both close-source and open-source models.
Model MMStar MMBench MMVet MathVista AI2D Hallusion Average
Close-Source
GPT-4o-0806 66.0 82.4 80.8 62.7 84.7 54.2 71.8
Claude3.5-Sonnet-0620 64.2 75.4 68.7 61.6 80.2 49.9 66.7
Gemini-1.5-Pro 56.4 71.5 71.3 57.7 79.1 45.6 63.6
GPT-4o-mini-0718 54.9 76.9 74.6 52.4 77.8 46.1 63.8
Open-Source
internVL2-8B 62.50 77.40 56.90 58.30 83.60 45.00 64.00
Ovis1.5-Gemma2-9B 58.70 76.30 50.90 65.60 84.50 48.20 64.00
MiniCPM-V2.6-8B 57.10 75.70 56.30 60.60 82.10 48.10 63.30
Llama-3.2-90B-Vision-Inst 51.10 76.80 74.10 58.30 69.50 44.10 62.30
VILA-1.5-40B 53.20 75.30 44.40 49.50 77.80 40.90 56.90
Llava-CoT 57.60 75.00 60.30 54.80 85.70 47.80 63.50
Our Models
Llama-3.2-11B-Vision-Inst 49.80 65.80 57.60 48.60 77.30 40.30 56.90
LlamaV-o1 (Ours) 59.53 79.89 65.40 54.40 81.24 63.51 67.33

Ablations

The impact of proposed contributions on multimodal reasoning tasks across six benchmarks.
Model MMStar MMBench MMVet MathVista AI2D Hallusion Average
Llama-3.2-11B-Vision-Inst (base) 49.80 65.80 57.60 48.60 77.30 40.30 56.90
+ Curriculum with Multi-Turn CoT Reasoning 58.13 79.55 61.88 53.20 80.18 63.51 66.08
+ Beam-Search Beam Search 59.53 79.89 65.40 54.40 81.24 63.51 67.33
Llava-CoT using Stage-level Beam Search (MMVet Benchmark)
Inference Scaling # Beams MMVet Score Time (GPU Hours)
No Scaling 1 60.3 3.8
Stage-level 2 61.7 20.1
Stage-level 3 62.3 31.5
Stage-level 4 62.9 46.1
Our Approach using Best-of-N Beam Search (MMVet Benchmark)
Inference Scaling # Beams MMVet Score Time (GPU Hours)
No Scaling 1 63.63 2.7
Beam-Search 2 64.26 4.8
Beam-Search 3 64.92 5.7
Beam-Search 4 65.40 6.1

BibTeX

@article{thawakar2025llamavo1,
            title = {LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs},
            author = {Thawakar, Omkar and Dissanayake, Dinura and More, Ketan and Thawkar, Ritesh and Heakl, Ahmed and Ahsan, Noor and Li, Yuhao and Zumri, Mohammed and Lahoud, Jean and Anwer, Rao Muhammad and Cholakkal, Hisham and Laptev, Ivan and Shah, Mubarak and Khan, Fahad Shahbaz and Khan, Salman},
            journal = {arXiv preprint arXiv:2501.06186},
            year = {2025},
            url = {https://arxiv.org/abs/2501.06186}
}