Point/track to all the cells in the video
Point to the nearest traffic light
Point to the carnivore for the robot to pick up
Point to the ball
Point to the knife used by the person
Point to the horse running around
VideoMolmo consists of four end-to-end trainable components: (1) a visual encoder, (2) a temporal module, (3) visual projector (4) a decoder-only large language model (LLM); and a post-processing module.
We evaluate VideoMolmo on four challenging tasks: point grounding, counting, referring segmentation, and reasoning video object segmentation
@misc{ahmad2025videomolmospatiotemporalgroundingmeets,
title={VideoMolmo: Spatio-Temporal Grounding Meets Pointing},
author={Ghazi Shazan Ahmad and Ahmed Heakl and Hanan Gani and Abdelrahman Shaker and Zhiqiang Shen and Ranjay Krishna and Fahad Shahbaz Khan and Salman Khan},
year={2025},
eprint={2506.05336},
archivePrefix={arXiv},
primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}