ThinkGeo_logo

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir1*, Muhammad Akhtar Munir1*, Akshay Dudhane1*, Muhammad Umer Sheikh1, Muhammad Haris Khan1, Paolo Fraccaro2, Juan Bernabe Moreno2, Fahad S. Khan1,3, Salman Khan1,4,

1Mohamed bin Zayed University of Artificial Intelligence, 2IBM Research, 3Linköping University 4Australian National University,

ThinkGeo is a specialized benchmark designed to evaluate how language model agents handle complex remote sensing tasks through structured tool use and step-by-step reasoning. It features human-curated queries grounded in satellite and aerial imagery across diverse real-world domains such as disaster response, urban planning, and environmental monitoring. Using a ReAct-style interaction loop, ThinkGeo tests both open and closed-source LLMs on over 400 multi-step agentic tasks. The benchmark measures not only final answer correctness but also the accuracy and consistency of tool usage throughout the process. By focusing on spatially grounded, domain-specific challenges, ThinkGeo fills a critical gap left by general-purpose evaluation frameworks.

Replacement Image

News Icon News

[May-29-2025]: 📂 ThinkGeo benchmark is released on HuggingFace HuggingFace
[May-29-2025]: 📜 Technical Report of ThinkGeo paper is released arxiv link.

🏆 Contributions

  1. A dataset comprising 436 remote sensing tasks, linked with medium to high-resolution earth observation imagery across domains like urban planning, disaster response, aviation, and environmental monitoring.

  2. A set of 14 executable tools simulates real-world RS workflows, including modules for perception, computation, logic, and visual annotation.

  3. Two evaluation modes—step-by-step and end-to-end—use with detailed metrics to assess instruction adherence, argument structure, reasoning steps, and final accuracy.
  4. Benchmarking advanced LLMs (GPT-4o, Claude-3, Qwen-2.5, LLaMA-3) reveals ongoing challenges in multimodal reasoning and tool integration.

ThinkGeo ThinkGeo: Dataset Examples

Image 1 Image 2 Image 3 Image 4 Clone of Image 1

The figure presents a set of representative samples from the ThinkGeo benchmark, a comprehensive evaluation framework for geospatial tasks. Each row in the table showcases a complete interaction flow, beginning with a user query that is grounded in remote sensing (RS) imagery. Following the user query, each example demonstrates a ReAct-based execution chain—an approach that integrates reasoning and action through a combination of tool calls and logical steps. These execution chains involve the dynamic selection and use of various tools, depending on the demands of the specific query.

The data samples span a wide range of application domains, underscoring the benchmark's diversity. These domains include transportation analysis, urban planning, disaster assessment and change analysis, recreational infrastructure, and environmental monitoring, highlighting multi-tool reasoning and spatial task complexity.

Results

Evaluation results across models on the ThinkGeo benchmark are summarized in the table. The left side presents step-by-step execution metrics, while the right side reports end-to-end performance. Metrics include tool-type accuracy—categorized by Perception (P), Operation (O), and Logic (L)—as well as final answer accuracy (Ans.) and answer accuracy with image grounding (Ans_I).

BibTeX


        @misc{shabbir2025thinkgeoevaluatingtoolaugmentedagents,
              title={ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks}, 
              author={Akashah Shabbir and Muhammad Akhtar Munir and Akshay Dudhane and Muhammad Umer Sheikh and Muhammad Haris Khan and Paolo Fraccaro and Juan Bernabe Moreno and Fahad Shahbaz Khan and Salman Khan},
              year={2025},
              eprint={2505.23752},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2505.23752}, 
        }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IVAL Logo Oryx Logo MBZUAI Logo