VideoGLaMM

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Mohamed bin Zayed University of Artificial Intelligence, Tianjin University,
Linköping University, Australian National University, Carnegie Mellon University

VideoGLaMM is a large video multimodal video model capable of pixel-level visual grounding. The model responds to natural language queries from the user and intertwines spatio-temporal object masks in its generated textual responses to provide a detailed understanding of video content. VideoGLaMM seamlessly connects three key components: a Large Language Model (LLM); dual vision encoders; and a spatio-temporal pixel decoder. The dual vision encoders extract spatial and temporal features separately, which are jointly passed to the LLM to output responses rich in both spatial and temporal cues. This is facilitated by end-to-end training on our proposed benchmark Grounded conversation Generation (GCG) dataset featuring 38k Video-QA triplets with 87k objects and 671k fine-grained masks.

🔥Highlights

The key contributions of this work are:
  1. We introduce Video Grounded Large Multi-modal Model (VideoGLaMM), a video large multimodal model, capable of pixel-level visual grounding, featuring an end-to-end alignment mechanism.

  2. To achieve fine-grained spatio-temporal alignment, we introduce a benchmark grounded conversation generation (GCG) dataset consisting of 38k grounded video-QA triplet pairs and 83k objects and roughly 671k fine-grained spatio-temporal masks.

  3. We assess the performance of VideoGLaMM across diverse tasks spanning grounded conversation generation, visual grounding, and referring video segmentation, where it achieves state-of-the-art performance

VideoGLaMM_face VideoGLaMM : Architecture

VideoGLaMM consists of following key components: (i) Spatio-Temporal Dual Encoder, (ii) Dual Alignment V-L Adapters for image and video fea- tures, (iii) Large Language Model (LLM) iv) L-V Adapter and (iv) Promptable Pixel Decoder.

VideoGLaMM consists of a dual spatio-temporal encoder for encoding image and video level features. The spatial features represent the local information and the temporal features represent global information. The spatial and temporal tokens are passed through V-L adapters and concatenated with the text tokens, before feeding to LLM. An L-V projector is employed to align LLM’s response with the visual space of pixel decoder. Finally, the aligned LLM features along with the frame features obtained from a frame encoder are jointly passed to a grounded pixel decoder, to obtain the fine-grained object masks corresponding to the LLM response.

VideoGLaMM_face VideoGLaMM : Benchmark and Annotation Pipeline

Proposed Semi-automatic Annotation Pipeline. Our dataset for grounded conversation generation (GCG) is built from three video dataset types: i) Videos having masks only: Object patches are extracted from video frames using masks and processed by the Gemini model for initial object descriptions, which are then refined to produce detailed object captions. These refined captions and masks are used again with the Gemini model to create dense grounded captions. ii) Videos having bbox annotations and captions: Frames are first processed with a Video-LMM to generate a comprehensive caption which is combined with the original caption and fed to GPT-4o to obtain dense grounded captions. Masks are generated using frames and ground-truth bounding boxes with the SAM model. iii) Videos having object bboxes and referring expressions: Frames, bounding boxes, and referring expressions are input to GPT-4o for dense grounded captions, while masks are generated by feeding frames and bounding boxes to the SAM model.

VideoGLaMM_face Grounded Conversation Generation (GCG)

Given user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.

BibTeX


@article{munasinghe2024videoglamm,
  title={VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos}, 
  author={Shehan Munasinghe and Hanan Gani and Wenqi Zhu and Jiale Cao and Eric Xing and Fahad Khan and Salman Khan},
  journal={ArXiv},
  year={2024},
  url={https://arxiv.org/abs/2411.04923}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IVAL Logo Oryx Logo MBZUAI Logo