Moving beyond keyword matching — video retrieval that understands the implied consequences of textual edits through structured visual reasoning.
Mohamed bin Zayed University of AI · University of Wisconsin-Madison · University of Chicago · Linköping University
Abstract
Composed Video Retrieval (CoVR) targets a target video given a reference video and a textual modification edit. Most prior work treats the edit as a set of literal keyword constraints and matches videos by surface-level overlap. However, real-world edits imply after-effects — causal, temporal, and cinematographic consequences that are never explicitly stated.
For example, the edit "change typing to frustration" implies clenched fists and closing the laptop. "Make it a close-up" implies tighter framing and shorter duration. "Turn raw ingredients into cooked" implies visible state transitions from raw to browned.
We demonstrate that successful CoVR requires reasoning about these implied consequences. To this end, we introduce CoVR-R, a reasoning-aware benchmark of 2,800 curated triplets with structured annotations, and a two-stage zero-shot framework — Reason-Then-Retrieve — powered by the Qwen3-VL multimodal model. Our approach achieves a +16% relative R@1 improvement over the best prior baseline without any task-specific fine-tuning.
Motivation
Figure 1. Examples where retrieval success depends on understanding after-effects — object state changes, temporal phase ordering, and cinematographic scale — rather than simple keyword matching. Standard retrieval systems fail on these cases.
Edits like "cook the ingredients" require modeling state changes from raw → browned, which is visually implicit and semantically rich.
Understanding the order of events in a video — not just what appears but when and how it transitions — is critical for temporal edits.
Changes in shot scale, camera motion, and framing carry implied visual consequences that keyword-based systems entirely miss.
Emotional or behavioral edits (e.g. "frustration") trigger cascaded visual effects that must be reasoned about explicitly.
Dataset
Each triplet consists of a reference video, a textual edit, and a target video — carefully curated to require reasoning, not keyword matching.
Distractors are intentionally designed to defeat keyword-level matching, forcing models to reason about implied visual consequences.
Every triplet includes a canonicalized reasoning trace across five dimensions: states, actions, scene, camera, and tempo.
Covers temporal dependency, state transitions, cinematographic consequences, and implicit cause-effect reasoning in a single unified framework.
Figure 2. Sample triplets from the CoVR-R benchmark with their associated structured reasoning traces. Each trace captures five reasoning dimensions that are necessary to identify the correct target video.
Example reasoning trace for the edit "turn raw ingredients into cooked." These five dimensions guide the retrieval model towards the correct target.
Methodology
A two-stage zero-shot architecture powered by a frozen Qwen3-VL multimodal language model. No task-specific fine-tuning. No captions required. Fully zero-shot.
Figure 3. Overview of the Reason–Then–Retrieve architecture. Given a reference video and a textual edit, Stage 1 generates a structured reasoning trace via Qwen3-VL. Stage 2 uses the trace to produce a hypothetical target description, extracts importance-weighted embeddings, and retrieves the best-matching video from the gallery.
V_rEStandard mean pooling treats all token embeddings equally. We instead assign weights to tokens based on their importance to the reasoning trace, significantly improving embedding quality for composed retrieval.
Evaluation
| Model | R@1 | R@5 | R@10 | R@50 | Reasoning Score |
|---|---|---|---|---|---|
| Best BLIP Baseline | 37.90 | 57.67 | 64.48 | 79.47 | 6.42 |
| Ours (no explicit reasoning) | 44.32 | 61.91 | 67.33 | 79.90 | – |
| Ours + Reasoning BEST | 49.88 | 66.99 | 72.97 | 85.14 | 8.31 |
| Model | R@1 | R@5 | R@10 | R@50 |
|---|---|---|---|---|
| BSE-CoVR | 48.08 | 73.36 | 81.06 | 93.78 |
| Ours | 58.19 | 80.50 | 86.92 | 97.14 |
| Ours + R BEST | 61.21 | 83.40 | 89.39 | 97.61 |
Analysis
Performance scales with the backbone's reasoning capability.
| Backbone | R@1 | Reasoning Score |
|---|---|---|
| Qwen3-VL-4B | 43.98 | 7.95 |
| Qwen3-VL-8B | 49.88 | 8.31 |
| Qwen3-VL-72B | 55.48 | 9.05 |
Importance-weighted pooling significantly outperforms all standard aggregation strategies.
| Pooling Method | R@1 |
|---|---|
| Mean | 44.87 |
| Max | 35.95 |
| Last Token | 1.51 |
| Importance-Weighted (Ours) | 49.88 |
Impact
CoVR-R fundamentally reframes composed video retrieval from triplet matching to consequence modeling.
Real-world edits trigger visual effects far beyond the literal words. Surface-level matching systematically fails.
Understanding the before-and-after state of a scene — not just what's present — determines retrieval success.
Shot scale, camera motion, and framing are semantically meaningful and must be modeled for accurate retrieval.
General-purpose large multimodal model reasoning can successfully replace task-specific retrieval training.
Object and scene state change modeling is a core competency for next-generation video retrieval systems.
Our framework needs no labeled retrieval data — it scales with model capability and generalizes broadly.
Getting Started
Install dependencies and run evaluation on any of the provided configurations.
Reference
If you find CoVR-R useful in your research, please cite our paper: