Video-LLaVA_face

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Mohamed bin Zayed University of AI, Australian National University,
Linköping University, University of Central Florida
*Equal Contributiion
PG-Video-LLaVA is the first video-based Large Multimodal Model (LMM) with pixel-level grounding capabilities. 🔥🔥🔥

🔥Highlights

The key contributions of this work are:
  1. We propose PG-Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially ground objects in videos following user instructions.

  2. We introduce a new benchmark specifically designed to measure prompt-based object grounding performance.

  3. By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).

  4. We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models.

GLaMM_face PG-Video-LLaVA : Architecture

Overview of the PG-Video-LLaVA architecture, showcasing the integration of a CLIP-based visual encoder with a multimodal language model for video understanding. The CLIP visual encoder extracts spatio-temporal features from videos by averaging frame-level features across temporal and spatial dimensions. These features are then projected into the LLM’s input space using a learnable Multi-Layer Perceptron (MLP). The system features a grounding module for spatially locating textual descriptions within video frames, a class-agnostic object tracker, and an entity matching module. Audio processing incorporates Voice Activity Detection, phoneme modeling, and Whisper-based audio transcription, culminating in a multimodal pipeline that facilitates robust video-question answering. The architecture is trained on a hybrid dataset of video instructions, enabling the handling of diverse conversational contexts with high accuracy.

GLaMM_face Qualitative Results : Video Grounding

Visual representation of the grounding capability of advanced video-conversational capabilities of PG-Video-LLaVA. The highlighted regions in each video frame indicate the model's ability to identify and spatially locate key subjects mentioned in the textual description, such as the giraffe, the statue, and the gymnast on a balance beam.

GLaMM_face Qualitative Results : Including Audio Modality

The figure illustrates the integrated audio processing pipeline that augments video-question answering with audio cues. It provides side-by-side comparisons showing how audio cues offer additional context, leading to a more accurate interpretation of the video content, as seen in the examples above.

GLaMM_face Video-ChatGPT vs PG-Video-LLaVA

Qualitative analysis of video descriptions generated by Video-ChatGPT, PG-Video-LLaVA (7B), and PG-Video-LLaVA (13B) models. The evolution in model performance is evident, with enhancements in the accuracy of information, richness of descriptive detail, and alignment with the video’s context and sequence of events as we move from the baseline Video-ChatGPT to the more advanced PG-Video-LLaVA (13B) model.

BibTeX


@article{munasinghe2023PGVideoLLaVA,
  title={PG-Video-LLaVA: Pixel Grounding Large Video-Language Models}, 
  author={Shehan Munasinghe and Rusiru Thushara and Muhammad Maaz and Hanoona Abdul Rasheed and Salman Khan and Mubarak Shah and Fahad Khan},
  journal={ArXiv 2311.13435},
  year={2023}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

IVAL Logo Oryx Logo MBZUAI Logo