The Embodied Bench is designed to evaluate reasoning in physical AI scenarios. It covers multiple robot types and modalities, allowing assessment of capabilities across tasks such as Next-Action prediction, Action affordance, Physical common sense, Temporal reasoning, Tool use and manipulation, Risk assessment, and Robot Navigation
Figure: Dataset distribution and question type composition of our benchmark. Question types include open-ended questions (Open), multiple choice questions (MCQ), and True/False questions (TF). For clarity, we decompose the Cosmos-R1 benchmark into its constituent sub-datasets Agibot, BridgeDataV2, HoloAssist, RoboVQA, and RoboFail to explicitly show the distribution of question types across these subsets.
Figure: Performance of different open source as well as closed source SoTA models, highlighting the reasoning accuracy as well as the final accuracy. Here, we evaluate the reasoning steps thoroughly using our proposed evaluation criteria.
Performance comparison of GPT-O4-mini and CosmosR1 on different categories of our benchmark
We evaluated our models using GPT-4o as well as Qwen3-32B, and both models give similar evaluations in final accuracy as well as reasoning accuracy
In this paper, we introduced the Foundation Model Embodied Reasoning (FoMER) benchmark designed to evaluate the embodied reasoning capabilities of LLMs. The benchmark consists of over 1,000 samples with detailed reasoning steps spanning 10 diverse task categories. In addition, we proposed a new evaluation framework that jointly assesses both action validity and reasoning correctness. We analyzed the performance of nine state-of-the-art models, including both open-source and proprietary systems. Our results reveal significant limitations of current models on embodied reasoning tasks and underscore the importance of analyzing and evaluating reasoning trails to better understand model capabilities. As a result, FoMER could serve as a testbed to identify potentially unsafe or unreliable reasoning in LLMs and agentic models before real-world deployment.
For additional details about evaluation and experimental results, please refer to our main paper. Thank you!
@misc{dissanayake2025goodfoundationmodelsstepbystep,
title={How Good are Foundation Models in Step-by-Step Embodied Reasoning?},
author={Dinura Dissanayake and Ahmed Heakl and Omkar Thawakar and Noor Ahsan and Ritesh Thawkar and Ketan More and Jean Lahoud and Rao Anwer and Hisham Cholakkal and Ivan Laptev and Fahad Shahbaz Khan and Salman Khan},
year={2025},
eprint={2509.15293},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.15293},
}