CoVR-R: Reason-Aware
Composed Video Retrieval

CVPR 2026 Findings

Moving beyond keyword matching — video retrieval that understands the implied consequences of textual edits through structured visual reasoning.

Mohamed bin Zayed University of AI  ·  University of Wisconsin-Madison  ·  University of Chicago  ·  Linköping University

2,800
Curated Triplets
+16%
Relative R@1 Gain
0-shot
No Fine-tuning
5
Reasoning Dimensions
49.88
R@1 (CoVR-R)

Retrieval That Understands Consequences

Composed Video Retrieval (CoVR) targets a target video given a reference video and a textual modification edit. Most prior work treats the edit as a set of literal keyword constraints and matches videos by surface-level overlap. However, real-world edits imply after-effects — causal, temporal, and cinematographic consequences that are never explicitly stated.

For example, the edit "change typing to frustration" implies clenched fists and closing the laptop. "Make it a close-up" implies tighter framing and shorter duration. "Turn raw ingredients into cooked" implies visible state transitions from raw to browned.

We demonstrate that successful CoVR requires reasoning about these implied consequences. To this end, we introduce CoVR-R, a reasoning-aware benchmark of 2,800 curated triplets with structured annotations, and a two-stage zero-shot framework — Reason-Then-Retrieve — powered by the Qwen3-VL multimodal model. Our approach achieves a +16% relative R@1 improvement over the best prior baseline without any task-specific fine-tuning.

Why Reasoning is Necessary

Figure 1: Examples showing after-effects

Figure 1. Examples where retrieval success depends on understanding after-effects — object state changes, temporal phase ordering, and cinematographic scale — rather than simple keyword matching. Standard retrieval systems fail on these cases.

🔄

Object State Transitions

Edits like "cook the ingredients" require modeling state changes from raw → browned, which is visually implicit and semantically rich.

Temporal Phase Progression

Understanding the order of events in a video — not just what appears but when and how it transitions — is critical for temporal edits.

🎥

Cinematographic Reasoning

Changes in shot scale, camera motion, and framing carry implied visual consequences that keyword-based systems entirely miss.

🧠

Cause-Effect Chains

Emotional or behavioral edits (e.g. "frustration") trigger cascaded visual effects that must be reasoned about explicitly.

The CoVR-R Benchmark

📦

2,800 Curated Triplets

Each triplet consists of a reference video, a textual edit, and a target video — carefully curated to require reasoning, not keyword matching.

⚔️

Hard Distractors

Distractors are intentionally designed to defeat keyword-level matching, forcing models to reason about implied visual consequences.

📝

Structured Reasoning Traces

Every triplet includes a canonicalized reasoning trace across five dimensions: states, actions, scene, camera, and tempo.

🎯

Multi-Dimensional Focus

Covers temporal dependency, state transitions, cinematographic consequences, and implicit cause-effect reasoning in a single unified framework.

Figure 2: CoVR-R benchmark triplets

Figure 2. Sample triplets from the CoVR-R benchmark with their associated structured reasoning traces. Each trace captures five reasoning dimensions that are necessary to identify the correct target video.

Structured Reasoning Annotation Format

R = {
  states: "raw ingredients → browned/cooked",
  actions: "stirring → plating",
  scene: "kitchen counter with visible steam",
  camera: "static overhead → slight tilt for reveal",
  tempo: "slow-paced with time-lapse acceleration"
}

Example reasoning trace for the edit "turn raw ingredients into cooked." These five dimensions guide the retrieval model towards the correct target.

Reason–Then–Retrieve Framework

A two-stage zero-shot architecture powered by a frozen Qwen3-VL multimodal language model. No task-specific fine-tuning. No captions required. Fully zero-shot.

Figure 3: Reason-Then-Retrieve pipeline

Figure 3. Overview of the Reason–Then–Retrieve architecture. Given a reference video and a textual edit, Stage 1 generates a structured reasoning trace via Qwen3-VL. Stage 2 uses the trace to produce a hypothetical target description, extracts importance-weighted embeddings, and retrieves the best-matching video from the gallery.

Input

Reference + Edit

  • Reference video V_r
  • Textual modification E
  • Processed by frozen Qwen3-VL
Stage 1

After-Effect Reasoning

  • Structured trace generation
  • Object state reasoning
  • Action phase prediction
  • Scene & camera forecasting
  • Temporal pacing analysis
Stage 2

Target Description & Retrieval

  • Hypothetical post-edit description
  • Token embedding extraction
  • Importance-weighted pooling
  • Cosine similarity against gallery
Output

Retrieved Target Video

  • Top-k ranked results
  • Evaluated via R@1/5/10/50
  • Zero-shot, caption-free

Importance-Weighted Pooling

Standard mean pooling treats all token embeddings equally. We instead assign weights to tokens based on their importance to the reasoning trace, significantly improving embedding quality for composed retrieval.

49.88
Importance-Weighted (Ours)
44.87
Mean Pooling
35.95
Max Pooling
1.51
Last Token

State-of-the-Art Results

49.88
↑ +16% vs. BLIP baseline
R@1 with Reasoning
66.99
↑ +9.32 absolute
R@5 with Reasoning
72.97
↑ +8.49 absolute
R@10 with Reasoning
8.31
vs. 6.42 baseline
Reasoning Score
Model R@1 R@5 R@10 R@50 Reasoning Score
Best BLIP Baseline 37.90 57.67 64.48 79.47 6.42
Ours (no explicit reasoning) 44.32 61.91 67.33 79.90
Ours + Reasoning BEST 49.88 66.99 72.97 85.14 8.31
61.21
↑ +13.13 absolute vs BSE-CoVR
R@1 (Ours + R)
83.40
↑ +10.04 absolute
R@5 (Ours + R)
89.39
↑ +8.33 absolute
R@10 (Ours + R)
97.61
↑ +3.83 absolute
R@50 (Ours + R)
Model R@1 R@5 R@10 R@50
BSE-CoVR 48.08 73.36 81.06 93.78
Ours 58.19 80.50 86.92 97.14
Ours + R BEST 61.21 83.40 89.39 97.61

Key Ablations

🔹 Model Scaling (CoVR-R)

Performance scales with the backbone's reasoning capability.

BackboneR@1Reasoning Score
Qwen3-VL-4B43.987.95
Qwen3-VL-8B49.888.31
Qwen3-VL-72B55.489.05

🔹 Pooling Strategy

Importance-weighted pooling significantly outperforms all standard aggregation strategies.

Pooling MethodR@1
Mean44.87
Max35.95
Last Token1.51
Importance-Weighted (Ours)49.88

Why This Matters

CoVR-R fundamentally reframes composed video retrieval from triplet matching to consequence modeling.

Keyword Matching is Insufficient

Real-world edits trigger visual effects far beyond the literal words. Surface-level matching systematically fails.

Temporal Reasoning is Critical

Understanding the before-and-after state of a scene — not just what's present — determines retrieval success.

Cinematography Matters

Shot scale, camera motion, and framing are semantically meaningful and must be modeled for accurate retrieval.

LMM Reasoning Drives Retrieval

General-purpose large multimodal model reasoning can successfully replace task-specific retrieval training.

State Transitions Must Be Modeled

Object and scene state change modeling is a core competency for next-generation video retrieval systems.

Zero-Shot Scalability

Our framework needs no labeled retrieval data — it scales with model capability and generalizes broadly.

Quick Start

Install dependencies and run evaluation on any of the provided configurations.

Install

pip install -r requirements.txt

CoVR-R (Reasoning-Based Retrieval)

# Generate embeddings
python generate_embeddings.py --config configs/reasoning_webvid8m.yaml

# Evaluate
python evaluate.py --config configs/reasoning_webvid8m.yaml

# With self-consistency reasoning
python evaluate.py --config configs/reasoning_webvid8m.yaml --reasoning_strategy self_consistency

# Single-stage retrieval
python evaluate.py --config configs/reasoning_webvid8m.yaml --reasoning_strategy single_stage

WebVid8M & Dense WebVid8M

# WebVid8M
python generate_embeddings.py --config configs/webvid8m.yaml
python evaluate.py --config configs/webvid8m.yaml

# Dense WebVid8M
python generate_embeddings.py --config configs/dense_webvid8m.yaml
python evaluate.py --config configs/dense_webvid8m.yaml

Citation

If you find CoVR-R useful in your research, please cite our paper:

@inproceedings{thawakar2026covrr,
  title = {CoVR-R: Reason-Aware Composed Video Retrieval},
  author = {Thawakar, Omkar and
               Demidov, Dmitry and
               Potlapalli, Vaishnav and
               Bogireddy, Sai Prasanna Teja Reddy and
               Gajjala, Viswanatha Reddy and
               Lasheen, Alaa Mostafa and
               Anwer, Rao Muhammad and
               Khan, Fahad Shahbaz
},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year = {2026}
}