VideoMolmo: Spatio-Temporal Grounding meets Pointing

1Mohamed Bin Zayed University of Artificial Intelligence 2University of Washington 3Allen Institute for Artificial Intelligence 4Linköping University 5Australian National University

VideoMolmo is a a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability.

🔥Highlights

Key contributions of VideoMolmo:
  1. We introduce VideoMolmo , an LMM that accepts natural-language queries and produces point-level predictions for target objects across entire video sequences, ensuring temporal consistency.

  2. We further introduce Temporal module to leverage past temporal context and propose a novel temporal mask fusion pipeline for enhanced temporal coherence.

  3. To achieve fine-grained spatio-temporal pointing, we introduce a comprehensive dataset of 72k video-caption pairs and 100k object points.

  4. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also assess our model on Referring Video Object Segmentation (Ref-VOS) and Reasoning VOS tasks.

VideoMolmo : Architecture

VideoMolmo consists of four end-to-end trainable components: (1) a visual encoder, (2) a temporal module, (3) visual projector (4) a decoder-only large language model (LLM); and a post-processing module.

VideoMolmo is trained end-to-end for spatio-temporal pointing conditioned on textual instructions. It features a visual encoder, a temporal module, an LLM, and a post-processing module. The visual encoder processes the video frames and outputs multi-crop features. To maintain temporal consistency, we introduce a temporal module which employs a cross-attention operation and ensures that the current frame attends to the information in previous frames. The resultant features are then passed to the LLM, which, along with the textual query, processes this information and outputs the points corresponding to the objects. Our post-processing module takes the predictions from VideoMolmo and uses SAM2 to propagate the points across all video frames bidirectionally.

VideoMolmo Training Dataset: Annotation Pipeline

VideoMolmo annotation pipeline: We construct point-level supervision from frame- level masks using a semi-automatic process. For each frame, k points are sampled on the mask and passed to SAM2 to generate candidate masks. The point with the highest-IoU candidate mask (w.r.t. ground truth) is selected as the optimal annotation.

VideoMolmo: Quantitative Results

We evaluate VideoMolmo on four challenging tasks: point grounding, counting, referring segmentation, and reasoning video object segmentation

VideoMolmo: Qualitative Results on VPoS-Bench

VideoMolmo demonstrates robust generalization and fine-grained spatio-temporal grounding across diverse out-of-distribution scenarios from our proposed benchmark, for instance, correctly pointing to traffic lights (2nd row) in challenging driving scenes despite never encountering such scenarios during training.

BibTeX

@misc{ahmad2025videomolmospatiotemporalgroundingmeets,
      title={VideoMolmo: Spatio-Temporal Grounding Meets Pointing}, 
      author={Ghazi Shazan Ahmad and Ahmed Heakl and Hanan Gani and Abdelrahman Shaker and Zhiqiang Shen and Ranjay Krishna and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2506.05336},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}