GLaMM: Pixel Grounding Large Multimodal Model

Mohamed bin Zayed University of AI, Australian National University, Aalto University,
Carnegie Mellon University, University of California - Merced, Linköping University, Google Research
*Equally contributing first authors

Grounding Large Multimodal Model (GLaMM) is an end-to-end trained LMM which provides visual grounding capabilities with the flexibility to process both image and region inputs. This enables the new unified task of Grounded Conversation Generation that combines phrase grounding, referring expression segmentation and vision-language conversations. Equipped with the capability for detailed region understanding, pixel-level groundings, and conversational abilities, GLaMM offers a versatile capability to interact with visual inputs provided by the user at multiple granularity levels (objects, object parts, attributes, relationships and holistic scene understanding).


  1. GLaMM Introduction. We present the Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks. Unlike existing models, GLaMM accommodates both textual and optional visual prompts, facilitating enhanced multimodal user interaction.

  2. Novel Task & Evaluation. Recognizing the lack of standardized benchmarks for visually grounded conversations, we propose a new task of Grounded Conversation Generation (GCG). Alongside, we introduce a comprehensive evaluation protocol to measure the efficacy of models in this novel setting, filling a significant gap in the literature.

  3. GranD Dataset Creation. To facilitate model training and evaluation, we create GranD - Grounding-anything Dataset, a large-scale densely annotated dataset. Developed using an automatic annotation pipeline and verification criteria, it encompasses 7.5M unique concepts grounded in 810M regions. Additionally, we produce GranD-f, a high-quality dataset explicitly designed for the GCG task, by re-purposing existing open-source datasets.

GLaMM_face GLaMM: Grounding Large Multimodal Model

GLaMM consists of five core components to achieve visually grounded conversations: i) Global Image Encoder, ii) Region Encoder, iii) LLM, iv) Grounding Image Encoder, and v) Pixel Decoder. These components are cohesively designed to handle both textual and optional visual prompts (image level and region of interest), allowing for interaction at multiple levels of granularity, and generating grounded text responses.

The figure illustrates our model architecture showcasing its ability to offer scene-level understanding, region-level interpretation, and pixel-level grounding. The bottom row shows the diverse downstream applications of GLaMM, including referring expression segmentation, region-level captioning, image-level captioning and phrase grounding.

GLaMM_face Grounding-anything Dataset (GranD)

Detailed region-level understanding requires the laborious process of collecting large-scale annotations for image regions. To alleviate the manual labelling effort, we propose an automated pipeline to annotate the large-scale Grounding-anything Dataset. Leveraging the automated pipeline with dedicated verification steps, GranD comprises 7.5M unique concepts anchored in a total of 810M regions, each with a segmentation mask.

The figure shows the annotation Pipeline of the GranD Dataset. Level-1 details objects and attributes, level-2 includes short captions and relational markers, level-3 builds a scene graph, hierarchically organizing information from earlier levels to facilitate LLM for grounded dense captions, level-4 provides additional historical and societal context for a richer visual understanding.

Below we present some examples of the GranD dataset. Our automated annotation pipeline provides multiple semantic tags and attributes for objects along with segmentation masks. The dense caption thoroughly describes the visual scene with part of the text grounded to the corresponding objects. The additional context provides a deeper understanding of the scene, going beyond what's observed.

Dataset samples from GranD dataset.

Building GranD-f for Grounded Conversation Generation

Motivated by the need for higher-quality data in fine-tuning stage, we introduce GranD-f. Explicitly designed for the GCG task, this dataset encompasses approximately 214K image-grounded text pairs. Of these, 2.6k samples are reserved for validation and 5k for testing. GranD-f comprises two primary components: one subset is manually annotated and the other subset derived by re-purposing existing open-source datasets.

GLaMM_face Grounded Conversation Generation (GCG)

The objective of the GCG task is to construct image-level captions with specific phrases directly tied to corresponding segmentation masks in the image. By introducing the GCG task, we bridge the gap between textual and visual understanding, thereby enhancing the model’s ability for fine-grained visual grounding alongside natural language captioning.

Qualitative results of GLaMM's performance in grounded conversation generation.
Comparison of GLaMM Model Performance on GCG Task: Metrics include METEOR, CIDEr, AP, mIoU, and Mask Recall for both validation and test sets in our proposed benchmark. LISA* indicates a modified LISA adapted for GCG.
Model Validation Set Test Set
BuboGPT 17.2 3.5 19.6 53.5 30.3 17.1 3.4 17.5 53.8 27.4
Kosmos-2 16.2 27.1 17.8 55.4 28.7 15.9 26.8 17.4 56.8 29.1
LISA* 13.3 35.6 26.2 62.1 37.6 13.1 33.0 25.1 61.6 36.1
GLaMM 13.7 35.7 27.3 62.0 38.7 13.3 34.8 26.0 62.0 36.8

GLaMM_face Downstream Applications

Referring Expression Segmentation

In this task, the model receives an image along with a text-based referring expression, to which it outputs a corresponding segmentation mask. We present quantitative results on the validation and testing sets of refCOCO, refCOCO+, and refCOCOg.

Qualitative results of GLaMM's capability in referring expression segmentation.
Quantitative Assessment of GLaMM in Referring-Expression Segmentation: Performance across refCOCO, refCOCO+, and refCOCOg in generating accurate segmentation masks based on text-based referring expressions surpasses that of closely related work.
Method refCOCO refCOCO+ refCOCOg
val testA testB val testA testB val(U) test(U)
CRIS (CVPR-22) 70.5 73.2 66.1 65.3 68.1 53.7 59.9 60.4
LAVT (CVPR-22) 72.7 75.8 68.8 62.1 68.4 55.1 61.2 62.1
GRES (CVPR-23) 73.8 76.5 70.2 66.0 71.0 57.7 65.0 66.0
X-Decoder (CVPR-23) - - - - - - 64.6 -
SEEM (arXiv-23) - - - - - - 65.7 -
LISA-7B (ZS) (arXiv-23) 74.1 76.5 71.1 62.4 67.4 56.5 66.4 68.4
LISA-7B (FT) (arXiv-23) 74.9 79.1 72.3 65.1 70.8 58.1 67.9 70.6
GLaMM (ZS) 54.7 58.1 52.2 42.5 47.1 39.5 54.8 55.6
GLaMM (FT) 78.3 81.5 74.4 68.0 75.7 61.8 72.5 72.0

Region-Level Captioning

In this task, the goal is to generate referring expressions, or region-specific captions. The model is input an image, a designated region, and accompanying text, and is then tasked with responding to questions about the specified region. We conduct a quantitative assessment of region-level captioning on two OpenSource datasets: Visual Genome and refCOCOg. The qualitative results shown below demonstrates GLaMM's ability to adeptly generate region-specific captions, translating the intricate details from designated regions into coherent textual descriptions, enriched by its training on the comprehensive GranD dataset. This capability, combined with the inherent reasoning abilities of LLMs, enables it to tackle reasoning-based visual questions about these regions.

Qualitative illustration of GLaMM's performance in region-level captioning.
Performance of GLaMM in Region-Level Captioning: Metrics include METEOR and CIDEr scores, assessed on Visual Genome and refCOCOg Datasets, exhibiting competitive results.
Model refCOCOg Visual Genome
GRIT 15.2 71.6 17.1 142
Kosmos-2 14.1 62.3 - -
GPT4RoI - - 17.4 145.2
GLaMM (ZS) 15.7 104.0 17.0 127.0
GLaMM (FT) 16.2 105.0 18.6 157.8

Image Captioning

GLaMM offers favourable performance when compared with recent models specialized in image captioning, as well as other LMMs. Qualitative results for image captioning are shown below.

Qualitative results of GLaMM on image-level captioning tasks..

GLaMM_face Conversational Style Question Answering

We evaluate our model on conversational style question answering. The qualitative results shown below showcases GLaMM engaging in multi-turn dialogues, providing detailed descriptions, addressing region-specific inquiries, and presenting grounded conversations. This effectively highlights its adaptability in intricate visual-language interactions and robustly retaining reasoning capabilities inherent to LLMs.

Multimodal conversational interactions facilitated by GLaMM.


          author={Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal,
          Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, Fahad S. Khan},
          title={GLaMM: Pixel Grounding Large Multimodal Model},
          publisher={ArXiv 2311.03356},


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA, GPT4ROI, and LISA for releasing their models and code as open-source contributions.

IVAL Logo Oryx Logo MBZUAI Logo