geochat_logo

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

Mohamed bin Zayed University of AI, Birla Institute of Technology & Science, Australian National University, Linkoping University
*Equally contributing first authors

GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. Unlike general-domain models, GeoChat excels in handling high-resolution RS imagery, employing region-level reasoning for comprehensive scene interpretation. Leveraging a newly created RS multimodal dataset, GeoChat is fine-tuned using the LLaVA-1.5 architecture. This results in robust zero-shot performance across various RS tasks, including image and region captioning, visual question answering, scene classification, visually grounded conversations, and referring object detection.

🏆 Contributions

  1. RS multimodal instruction following dataset. We present a novel data generation pipeline, to leverage existing object detection dataset to create short descriptions of the images, followed by using Vicuna-v1.5 to create conversations using the generated text alone. Further, we add visual question-answering and scene classification abilities using their corresponding datasets. This results in a total of 318k instruction pairs for RS domain.

  2. GeoChat. Leveraging our dataset, we finetune LLaVA-1.5 to create the remote sensing-domain vision-language model - GeoChat. Our LoRA fine-tuning is efficient and avoids forgetting the necessary context embedded in fully-tuned LLaVA model, whose MLP projection is trained to align images into the word embedding space of the LLM (Vicuna-v1.5). This allows GeoChat to retain the conversation and instruction following abilities of LLaVA and extend its domain-knowledge to remote sensing tasks.

  3. Evaluation Benchmark. We also address the lack of evaluation benchmarks to assess the capability of existing VLMs on remote-sensing conversations. To this end, we setup evaluation protocols for conversation grounding in RS, as well as a setup a suite of tasks to allow comparisons with future efforts in this direction. We show various supervised as well as zero-shot evaluations for different remote sensing tasks, including image captioning, visual question answering and scene classification to demonstrate the generalisability of GeoChat conversational VLM.

geochat GeoChat: A unified framework

GeoChat can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability.

geochat GeoChat: Architecture

An overview of GeoChat - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Vicuna 1.5). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. GeoChat can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.

geochat_logo RS Multimodal Instruction Dataset

Types of annotations available in the GeoChat instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations (shown over the image). This structured information is used to create the rich instruction-set with a total of 318k image-instruction pairs.


geochat_logo Qualitative Results

Qualitative results of GeoChat. (left-right) Results are shown on grounding, referring object detection, and disaster/damage detection. The user can provide task-specific tokens (e.g., [grounding]) to shape model responses according to the desired behavior. The model can generate textual responses (right), only visual grounding (center) and both text and object groundings interleaved together (left). The model can also specify object types, object counts, object attributes and object relationships.

Visual Question Answering (VQA)

GeoChat is able to hold multi-turn conversations, based on various types of questions, including presence, count, complex comparisons and so on. It is able to detect objects and hold conversations against low resolution images as well.

Qualitative results of GeoChat's performance in Visual Question Answering.
Method Presence Comparison Rural/Urban Avg. Accuracy
LLaVA-1.5 55.46 68.20 59.00 62.77
Qwen-vl-Chat 38.57 67.59 61.00 55.35
MiniGPTv2 55.16 55.22 39.00 54.96
RSVQA 87.47 81.50 90.00 86.32
EasyToHard 90.66 87.49 91.67 89.94
Bi-Modal 91.06 91.16 92.66 91.63
SHRNet 91.03 90.48 94.00 91.84
RSGPT 91.17 91.70 94.00 92.29
GeoChat 91.09 90.33 94.00 90.70
Comparisons with general zero-shot (top) and RS-VQA specialized (middle) models on RSVQA-LRBEN dataset for VQA task. LLaVA-1.5, Qwen-vl-Chat and MiniGPTv2 are evaluated in zero-shot setting. GeoChat outperforms other zero-shot models and performs competitively to SoTA-supervised models like RSGPT which are specifically finetuned on target dataset (while ours is a generic model not specifically finetuned on target dataset).

Scene Classification

For scene classification, the model is presented with a list of dataset classes in the input prompt and tasked with selecting a single class from the provided options.

Qualitative results of GeoChat's performance in scene classification.
Model UCMerced AID
Qwen-VL 62.90 52.60
MiniGPTv2 4.76 12.90
LLaVA-1.5 68.00 51.00
GeoChat 84.43 72.03
Zero-shot scene classification accuracy comparison on AID and UCMerced datasets. In comparison to other generic VLMs, GeoChat performs favorably well.

Region-Level Caption

Given a bounding box, GeoChat is able to provide brief descriptions about the area or the object covered by the bounding box.

Qualitative results of GeoChat's performance in region-level caption. .
Model ROUGE-1 ROUGE-L METEOR
MiniGPTv2 32.1 31.2 10.0
GeoChat 87.3 87.2 83.9
Region level captioning performance.

Grounded Description

When asked to describe the image with the special token '[grounding]', GeoChat outputs both the description of the image as well as the bounding boxes for all the objects detected.

Qualitative results of GeoChat's performance in grounded description.
Model acc@0.5 acc@.25 METEOR
MiniGPTv2 10.8 30.9 16.4
GeoChat 11.7 33.9 48.9
Results on grounding description task.

Referring Expression

When asked about an object as a referred expression, GeoChat is able to locate object by predicting bounding boxes around it correspondingly.

Qualitative results of GeoChat's performance in referring expression.
Model Small Medium Large Single-object grounding Multi-object grounding [refer] [grounding] Overall
MiniGPTv2 1.7 9.9 21.9 9.1 3.6 8.2 2.6 7.6
GeoChat 2.9 13.6 21.7 16.0 4.3 10.5 11.8 10.6
Performance (acc@0.5%) comparison of GeoChat on our benchmark. Small, medium and large refer to the size of the objects based on the bounding box area. Overall, GeoChat outperforms the baseline, but there is still significant room for further improvement on this complex task.

BibTeX


    @misc{kuckreja2023geochat,
        title={GeoChat: Grounded Large Vision-Language Model for Remote Sensing}, 
        author={Kartik Kuckreja and Muhammad Sohail Danish and Muzammal Naseer and 
            Abhijit Das and Salman Khan and Fahad Shahbaz Khan},
        year={2023},
        eprint={2311.15826},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }  
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA and Vicuna for releasing their models and code as open-source contributions.

IVAL Logo Oryx Logo MBZUAI Logo