How Good are Foundation Models in Step-by-Step Embodied Reasoning?

1Mohamed bin Zayed University of AI,
2Australian National University, 3Linkoping University,


Figure: Example illustrating the final answer and step-by-step reasoning from Gemini and Qwen for a given video and text prompt. While both models correctly identify the action as withdraw bolt, the reasoning they provide differs significantly. This underscores the importance of evaluating the final answer as well as the underlying reasoning.

Abstract

Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence.




Table A comparison of various Embodied and Physical AI benchmarks. We summarize key features across benchmarks, including input modalities, question formats, presence of step-by-step reasoning trails, number of annotated questions, annotation methods, diversity of tasks and embodiments, and the types of robots involved. Our benchmark (last row) is distinguished by explicitly incorporating reasoning traces, supporting a variety of question types, and covering a broader set of tasks and robotic platforms compared to prior work.


Benchmark Curation: Building the FoMER-Bench

The Embodied Bench is designed to evaluate reasoning in physical AI scenarios. It covers multiple robot types and modalities, allowing assessment of capabilities across tasks such as Next-Action prediction, Action affordance, Physical common sense, Temporal reasoning, Tool use and manipulation, Risk assessment, and Robot Navigation

  • Benchmark Assembly To capture the full scope of physical reasoning, this benchmark was curated from multiple existing datasets. Some datasets (e.g., Cosmos-R1, Pbench) already contained QA pairs, while others required generation of QA pairs and reasoning trails through a semi-automated pipeline. For datasets with existing QA pairs, reasoning trails were added to link each answer to logical, step-by-step explanations.
  • QA and Reasoning Trail Generation This benchmark is constructed using a structured pipeline with Qwen2.5-VL-32B-Instruct.First, the model identifies all visible objects, dynamic elements, and interactions in each scenario. Based on this context, it generates QA pairs with chain-of-thought reasoning, targeting skills such as physical common sense, spatial and temporal
  • Manual Verification To ensure quality, all generated QA pairs and reasoning trails were manually verified. Volunteers checked that questions are relevant, physically plausible, and aligned with intended task categories. Reasoning trails were refined by adding or removing steps, and trivial or misaligned questions were removed.
  • Task Ontology For detailed evaluation, questions are categorized into eleven task types: Task completion verification, Next-action prediction, Action affordance, Physical common sense, Robot-centric reasoning, Temporal reasoning, Tool use and manipulation, Social navigation, Human-robot object interaction, and Risk assessment.

FoMER-Bench Overview

Figure: Dataset distribution and question type composition of our benchmark. Question types include open-ended questions (Open), multiple choice questions (MCQ), and True/False questions (TF). For clarity, we decompose the Cosmos-R1 benchmark into its constituent sub-datasets Agibot, BridgeDataV2, HoloAssist, RoboVQA, and RoboFail to explicitly show the distribution of question types across these subsets.


Figure: Performance of different open source as well as closed source SoTA models, highlighting the reasoning accuracy as well as the final accuracy. Here, we evaluate the reasoning steps thoroughly using our proposed evaluation criteria.



Performance of Open- and Closed-Source LMMs on our Benchmark


Performance Analyzation

Performance comparison of GPT-O4-mini and CosmosR1 on different categories of our benchmark



We evaluated our models using GPT-4o as well as Qwen3-32B, and both models give similar evaluations in final accuracy as well as reasoning accuracy

Conclusion

In this paper, we introduced the Foundation Model Embodied Reasoning (FoMER) benchmark designed to evaluate the embodied reasoning capabilities of LLMs. The benchmark consists of over 1,000 samples with detailed reasoning steps spanning 10 diverse task categories. In addition, we proposed a new evaluation framework that jointly assesses both action validity and reasoning correctness. We analyzed the performance of nine state-of-the-art models, including both open-source and proprietary systems. Our results reveal significant limitations of current models on embodied reasoning tasks and underscore the importance of analyzing and evaluating reasoning trails to better understand model capabilities. As a result, FoMER could serve as a testbed to identify potentially unsafe or unreliable reasoning in LLMs and agentic models before real-world deployment.

For additional details about evaluation and experimental results, please refer to our main paper. Thank you!

BibTeX

@misc{dissanayake2025goodfoundationmodelsstepbystep,
      title={How Good are Foundation Models in Step-by-Step Embodied Reasoning?}, 
      author={Dinura Dissanayake and Ahmed Heakl and Omkar Thawakar and Noor Ahsan and Ritesh Thawkar and Ketan More and Jean Lahoud and Rao Anwer and Hisham Cholakkal and Ivan Laptev and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2509.15293},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.15293}, 
}