geopixel_logo

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Akashah Shabbir1, Mohammed Zumri1, Mohammed Bennamoun2, Fahad S. Khan1,3, Salman Khan1,4,

1Mohamed bin Zayed University of Artificial Intelligence, 2The University of Western Australia, 3Linköping University 4Australian National University,

GeoPixel is the first end-to-end high-resolution Remote Sensing Large Multimodal Model that supports pixel-level grounding. It facilitates detailed visual analysis by seamlessly integrating interleaved mask generation into conversations. With support for resolutions up to 4K HD in any aspect ratio, GeoPixel is ideal for precise remote sensing image analysis. To power grounded conversation generation (GCG) in remote sensing imagery, we also introduce GeoPixelD, a curated dataset created through an automated pipeline that leverages set-of-marks prompting and spatial priors tailored to RS data, ensuring systematic and controlled data generation. GeoPixel sets a new benchmark in pixel-level understanding, outperforming existing multimodal models in both single-target and multi-target segmentation tasks.

Replacement Image

🏆 Contributions

  1. GeoPixel: A specialized LMM for high resolution remote sensing image analysis. We introduce GeoPixel, a Large Multimodal Model (LMM) specifically designed for analyzing high-resolution remote sensing (RS) images with advanced multi-target pixel grounding capabilities. The model dynamically partitions input images into local and global regions, ensuring efficient encoding and analysis while supporting resolutions of up to 4K.

  2. GeoPixelD: A multi-modal dataset for grounded conversations. We created GeoPixelD, a multimodal grounded conversation dataset crafted for advanced remote sensing (RS) image analysis. It includes 53,816 grounded phrases associated with 600,817 objects, with captions averaging 740 characters. Featuring a hierarchically structured annotation framework, GeoPixelD delivers rich semantic descriptions, combining detailed scene-level context with precise, localized object details. The dataset is built using a scalable, semi-automated pipeline that leverages prior-informed visual prompting and state-of-the-art large multimodal models (LMMs). Rigorous verification and filtering processes ensure the production of high-quality, accurately grounded descriptions for individual objects and aggregated groups.

  3. Comprehensive evaluation benchmark for RS LMMs. We introduce a detailed benchmark specifically designed to evaluate remote sensing LMMs in fine-grained visual understanding and generation tasks. This benchmark includes 5,427 manually validated pairs of referring expressions and segmentation masks, representing 61,384 annotated objects in remote sensing images. With descriptions averaging 647 characters in length, it provides a reliable platform to measure the model's ability to interpret and generate responses to complex, spatially grounded information.

geochat GeoPixel: Architecture

The model comprises five key components: an adaptive image divider, a vision encoder, a large language model (LLM), a grounding vision encoder, and a pixel decoder. The adaptive image divider partitions input images into local and global regions, allowing efficient encoding at resolutions up to 4K. The vision encoder (scaled CLIP ViT-L/14) processes patches and global views, combining features for integration with the InternLM2 LLM, which aligns visual and textual modalities using Partial LoRA (Low-Rank Adaptation). For pixel-level grounding, GeoPixel incorporates SAM-2, a hierarchical masked autoencoder, and a lightweight pixel decoder to generate accurate segmentation masks. By leveraging pretrained encoders and modality-specific adaptations, GeoPixel achieves robust cross-modal alignment and enhanced performance for RS tasks, enabling precise segmentation and interpretation of high-resolution imagery.

GeoPixelD Annotation Pipeline

The GeoPixelD Annotation Pipeline provides detailed multi-tier descriptions of remote sensing imagery with object phrases aligned precisely with manually annotated masks. It begins with Holistic Image Annotation (bottom left), where an LMM generates concise scene descriptions. Individual Instance Annotation (bottom right) uses spatial({pos}) and categorical ({catagorory name}) priors with SOM ({mark number}) prompting to describe key objects. Cluster Annotation (top right) organizes smaller or dense objects using refined grids for precise spatial analysis.

Remote Sensing Grounded Conversation Generation (RS-GCG)

GeoPixel processes user queries to produce comprehensive descriptive outputs while simultaneously grounding identified objects through interleaved, pixel-level masks, demonstrating its advanced understanding and precise interpretation of high resolution remote sensing imagery.

Performance Comparison on RS-GCG task. LISA\( \dagger \) and PixelLM\( \dagger \) denote the pretrained LISA and PixelLM models adopted for RS-GCG and finetuned on GeoPixelD training data. GLaMM represents the zero-shot performance, whereas GLaMM-FT refers to the pretrained model finetuned on GeoPixelD. GeoPixel outperforms other models across all metrics.
Uni-Target Multi-Target Overall
Model CIDEr METEOR AP50 mIoU Recall AP50 mIoU Recall AP50 mIoU Recall
GLAMM 0.1 5.8 1.2 18.1 14.8 0.5 16.5 6.3 0.5 16.9 7.1
LISA 14.6 22.3 9.5 41.7 43.1 8.3 43.1 27.5 8.5 42.7 29.0
PixelLM 18.3 22.5 13.5 41.2 44.0 10.4 42.9 28.1 10.5 42.4 29.6
GLAMM-FT 15.7 23.0 18.8 44.4 48.5 12.4 47.1 31.1 12.5 46.4 32.8
GeoPixel 21.6 24.0 25.5 50.8 55.6 18.0 52.9 37.0 19.0 52.3 38.8

Referring Remote Sensing Image Segmentation (RRSIS)

GeoPixel demonstrates a robust capability to interpret referring expressions of varying complexity and lengths to accurately generate precise segmentation masks.

Method Validation set Test Set
P@0.5 olou mIoU P@0.5 olou mIoU
RRN 51.09 66.53 46.06 51.07 66.43 45.64
CSMA 55.68 69.68 48.85 55.32 69.39 48.54
LSCM 57.12 69.28 50.36 56.02 69.05 49.92
CMPC 57.93 70.15 50.41 55.83 69.22 49.24
BRINet 58.79 70.73 51.14 56.90 69.88 49.65
CMPC+ 59.19 70.14 51.41 57.65 68.64 50.24
LGCE 68.10 76.68 60.16 67.65 76.34 59.37
LAVT 69.54 77.59 61.46 69.52 77.19 61.04
RMSIN 74.66 78.27 65.10 74.26 77.79 64.20
GeoPixel 80.00 81.77 67.99 83.33 84.90 67.30

BibTeX


        @article{shabbir2025geopixel,
          title={GeoPixel : Pixel Grounding Large Multimodal Model in Remote Sensing}, 
          author={Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan},
          journal={ArXiv},
          year={2025},
          url={https://arxiv.org/abs/2501.13925}
        } 

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IVAL Logo Oryx Logo MBZUAI Logo