MATRIX

Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf¹, Umair Nawaz¹, Abdelrahman M. Shaker¹, Rao Muhammad Anwer¹, Philip Torr², Fahad Shahbaz Khan¹, Salman Khan¹

¹ MBZUAI,² University of Oxford

arXiv Code Data BibTeX

Abstract

As vision-language models (VLMs) evolve into multimodal controllers capable of interacting with external tools, a key limitation remains, the scarcity of high-quality multimodal trajectories and the prohibitive cost of manual annotation. To overcome this, we present MATRIX, a vision-centric agent tuning framework designed for robust and scalable tool-use reasoning. MATRIX automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller to execute complex reasoning and decision-making tasks. Our pipeline introduces M-TRACE, a large-scale dataset comprising 28.5K multimodal tasks and 177K verified trajectories, enabling imitation-based trajectory tuning for grounded tool interaction. Building upon this, we develop MATRIX Agent, a controller fine-tuned on M-TRACE for step-wise reasoning over visual and textual inputs. To achieve finer behavioral alignment, we further propose Pref-X, a collection of 11K automatically generated preference pairs, and optimize MATRIX through step-wise preference learning. Evaluations across three multimodal tool-use benchmarks, Agent-X, GTA, and GAIA demonstrate that MATRIX consistently outperforms both open- and closed-source VLMs, marking a significant step toward scalable, general-purpose multimodal agents with effective and reliable tool-use capabilities.

Introduction

MATRIX is a vision-centric agent tuning framework that teaches multimodal models to use tools effectively and safely. While existing agents rely on costly, narrow, manually curated tool-use data, MATRIX builds scalable learning from both traces (verified trajectories) and preferences (step-level refinements).

We first introduce M-TRACE — 28.5K multimodal tasks with 177K verified trajectories — for imitation-based grounding. Then, we propose Pref-X — 11K preference pairs — for fine-grained alignment via Direct Preference Optimization (DPO). This two-stage process helps agents plan, recover, and adapt tool usage with precision.

Evaluated on Agent-X, GTA, and GAIA, MATRIX outperforms both open and closed-source VLMs, improving accuracy by up to 23% and setting a new benchmark for robust multimodal tool-use reasoning.

Example Tasks

Below are example tasks where MATRIX demonstrates strong reasoning and accurate tool use across Agent-X, GTA, and GAIA benchmarks. Each task involves multimodal contexts (images, text, or structured inputs) and step-implicit reasoning—where the model must identify the right tools, plan actions, and execute them to reach the ground truth solution. These examples highlight MATRIX’s ability to combine perception, logic, and action in real-world settings. The dataset and full examples are available on Hugging Face.

Figure: Examples of MATRIX achieving accurate, grounded reasoning across Agent-X, GTA, and GAIA tasks — reaching near ground-truth performance through adaptive, step-wise tool selection.

MATRIX Agent

MATRIX is a vision-centric multimodal agent designed for reliable, step-wise reasoning and intelligent tool use. The key challenge for such agents lies in the scarcity of high-quality trajectories and the cost of manual annotations, both of which restrict scalability and generalization across diverse environments.

To address this, we introduce a two-stage training framework that combines trajectory supervision with preference optimization. In the first stage, Supervised Fine-Tuning (SFT) on automatically synthesized multimodal trajectories from M-TRACE equips the controller with structured tool-use and reasoning skills. In the second stage, preference optimization via Direct Preference Optimization (DPO) (Kong et al., 2025) on step-level exploration data (Pref-X) further refines decision-making beyond imitation, encouraging the agent to select accurate, consistent, and goal-directed actions.

M-TRACE Dataset

M-TRACE is a large-scale dataset of 28.5K multimodal tasks and 177K verified trajectories for grounded tool-use reasoning. Each trajectory is double-verified for semantic accuracy and execution validity, ensuring high-quality supervision for MATRIX.

Pref-X Dataset

Pref-X is a dataset of 11K step-wise preference pairs built to refine MATRIX beyond imitation. It enables the agent to compare candidate actions, favor accurate and consistent ones, and learn from step-level feedback. Each preference pair is derived through exploration and verification, allowing Direct Preference Optimization (DPO) to align the controller toward reliable, goal-directed tool use.

Evaluation & Results

We evaluate MATRIX across three challenging multimodal reasoning benchmarks — Agent-X, GTA, and GAIA. These tasks test fine-grained step reasoning, long-horizon tool use, and open-ended multimodal understanding. MATRIX consistently surpasses both open- and closed-source agents, demonstrating stronger grounding, faithfulness, and decision consistency.

On GTA and GAIA, which require realistic tool interaction and complex multimodal reasoning, MATRIX delivers up to +23% higher answer accuracy through its preference-tuned controller, enabling reliable step planning and goal completion in real-world tasks.

BibTeX

@misc{matrix,
  title={MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning}, 
  author={Tajamul Ashraf and Umair Nawaz and Abdelrahman M. Shaker and Rao Muhammad Anwer and Philip Torr and Fahad Shahbaz Khan and Salman Khan},
  year={2025},
  eprint={2510.08567},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.08567}
}