CoVR-R: Reason-Aware
Composed Video Retrieval

CVPR 2026 Findings

Moving beyond keyword matching — video retrieval that understands the implied consequences of textual edits through structured visual reasoning.

Omkar Thawakar^*, Dmitry Demidov^*, Vaishnav Potlapalli^*, Sai Prasanna Teja Reddy Bogireddy^*,
Viswanatha Reddy Gajjala^*, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan
^* Equal Contribution

Mohamed bin Zayed University of AI · University of Wisconsin-Madison · University of Chicago · Linköping University

Paper 🤗 HF Dataset Code Benchmark Results

2,800

Curated Triplets

+16%

Relative R@1 Gain

0-shot

No Fine-tuning

Reasoning Dimensions

49.88

R@1 (CoVR-R)

Abstract

Retrieval That Understands Consequences

Composed Video Retrieval (CoVR) targets a target video given a reference video and a textual modification edit. Most prior work treats the edit as a set of literal keyword constraints and matches videos by surface-level overlap. However, real-world edits imply after-effects — causal, temporal, and cinematographic consequences that are never explicitly stated.

For example, the edit "change typing to frustration" implies clenched fists and closing the laptop. "Make it a close-up" implies tighter framing and shorter duration. "Turn raw ingredients into cooked" implies visible state transitions from raw to browned.

We demonstrate that successful CoVR requires reasoning about these implied consequences. To this end, we introduce CoVR-R, a reasoning-aware benchmark of 2,800 curated triplets with structured annotations, and a two-stage zero-shot framework — Reason-Then-Retrieve — powered by the Qwen3-VL multimodal model. Our approach achieves a +16% relative R@1 improvement over the best prior baseline without any task-specific fine-tuning.

Motivation

Why Reasoning is Necessary

Figure 1: Examples showing after-effects

Figure 1. Examples where retrieval success depends on understanding after-effects — object state changes, temporal phase ordering, and cinematographic scale — rather than simple keyword matching. Standard retrieval systems fail on these cases.

🔄

Object State Transitions

Edits like "cook the ingredients" require modeling state changes from raw → browned, which is visually implicit and semantically rich.

⏱

Temporal Phase Progression

Understanding the order of events in a video — not just what appears but when and how it transitions — is critical for temporal edits.

🎥

Cinematographic Reasoning

Changes in shot scale, camera motion, and framing carry implied visual consequences that keyword-based systems entirely miss.

🧠

Cause-Effect Chains

Emotional or behavioral edits (e.g. "frustration") trigger cascaded visual effects that must be reasoned about explicitly.

Dataset

The CoVR-R Benchmark

📦

2,800 Curated Triplets

Each triplet consists of a reference video, a textual edit, and a target video — carefully curated to require reasoning, not keyword matching.

⚔️

Hard Distractors

Distractors are intentionally designed to defeat keyword-level matching, forcing models to reason about implied visual consequences.

📝

Structured Reasoning Traces

Every triplet includes a canonicalized reasoning trace across five dimensions: states, actions, scene, camera, and tempo.

🎯

Multi-Dimensional Focus

Covers temporal dependency, state transitions, cinematographic consequences, and implicit cause-effect reasoning in a single unified framework.

Figure 2. Sample triplets from the CoVR-R benchmark with their associated structured reasoning traces. Each trace captures five reasoning dimensions that are necessary to identify the correct target video.

Structured Reasoning Annotation Format

R = {
  states: "raw ingredients → browned/cooked",
  actions: "stirring → plating",
  scene: "kitchen counter with visible steam",
  camera: "static overhead → slight tilt for reveal",
  tempo: "slow-paced with time-lapse acceleration"
}

Example reasoning trace for the edit "turn raw ingredients into cooked." These five dimensions guide the retrieval model towards the correct target.

Methodology

Reason–Then–Retrieve Framework

A two-stage zero-shot architecture powered by a frozen Qwen3-VL multimodal language model. No task-specific fine-tuning. No captions required. Fully zero-shot.

Figure 3. Overview of the Reason–Then–Retrieve architecture. Given a reference video and a textual edit, Stage 1 generates a structured reasoning trace via Qwen3-VL. Stage 2 uses the trace to produce a hypothetical target description, extracts importance-weighted embeddings, and retrieves the best-matching video from the gallery.

Input

Reference + Edit

Reference video V_r
Textual modification E
Processed by frozen Qwen3-VL

Stage 1

After-Effect Reasoning

Structured trace generation
Object state reasoning
Action phase prediction
Scene & camera forecasting
Temporal pacing analysis

Stage 2

Target Description & Retrieval

Hypothetical post-edit description
Token embedding extraction
Importance-weighted pooling
Cosine similarity against gallery

Output

Retrieved Target Video

Top-k ranked results
Evaluated via R@1/5/10/50
Zero-shot, caption-free

Importance-Weighted Pooling

Standard mean pooling treats all token embeddings equally. We instead assign weights to tokens based on their importance to the reasoning trace, significantly improving embedding quality for composed retrieval.

49.88

Importance-Weighted (Ours)

44.87

Mean Pooling

35.95

Max Pooling

1.51

Last Token

Evaluation

State-of-the-Art Results

49.88

↑ +16% vs. BLIP baseline

R@1 with Reasoning

66.99

↑ +9.32 absolute

R@5 with Reasoning

72.97

↑ +8.49 absolute

R@10 with Reasoning

8.31

vs. 6.42 baseline

Reasoning Score

Model	R@1	R@5	R@10	R@50	Reasoning Score
Best BLIP Baseline	37.90	57.67	64.48	79.47	6.42
Ours (no explicit reasoning)	44.32	61.91	67.33	79.90	–
Ours + Reasoning BEST	49.88	66.99	72.97	85.14	8.31

61.21

↑ +13.13 absolute vs BSE-CoVR

R@1 (Ours + R)

83.40

↑ +10.04 absolute

R@5 (Ours + R)

89.39

↑ +8.33 absolute

R@10 (Ours + R)

97.61

↑ +3.83 absolute

R@50 (Ours + R)

Model	R@1	R@5	R@10	R@50
BSE-CoVR	48.08	73.36	81.06	93.78
Ours	58.19	80.50	86.92	97.14
Ours + R BEST	61.21	83.40	89.39	97.61

Analysis

Key Ablations

🔹 Model Scaling (CoVR-R)

Performance scales with the backbone's reasoning capability.

Backbone	R@1	Reasoning Score
Qwen3-VL-4B	43.98	7.95
Qwen3-VL-8B	49.88	8.31
Qwen3-VL-72B	55.48	9.05

🔹 Pooling Strategy

Importance-weighted pooling significantly outperforms all standard aggregation strategies.

Pooling Method	R@1
Mean	44.87
Max	35.95
Last Token	1.51
Importance-Weighted (Ours)	49.88

Impact

Why This Matters

CoVR-R fundamentally reframes composed video retrieval from triplet matching to consequence modeling.

Keyword Matching is Insufficient

Real-world edits trigger visual effects far beyond the literal words. Surface-level matching systematically fails.

Temporal Reasoning is Critical

Understanding the before-and-after state of a scene — not just what's present — determines retrieval success.

Cinematography Matters

Shot scale, camera motion, and framing are semantically meaningful and must be modeled for accurate retrieval.

LMM Reasoning Drives Retrieval

General-purpose large multimodal model reasoning can successfully replace task-specific retrieval training.

State Transitions Must Be Modeled

Object and scene state change modeling is a core competency for next-generation video retrieval systems.

Zero-Shot Scalability

Our framework needs no labeled retrieval data — it scales with model capability and generalizes broadly.

Getting Started

Quick Start

Install dependencies and run evaluation on any of the provided configurations.

Install

pip install -r requirements.txt

CoVR-R (Reasoning-Based Retrieval)

# Generate embeddings
python generate_embeddings.py --config configs/reasoning_webvid8m.yaml

# Evaluate
python evaluate.py --config configs/reasoning_webvid8m.yaml

# With self-consistency reasoning
python evaluate.py --config configs/reasoning_webvid8m.yaml --reasoning_strategy self_consistency

# Single-stage retrieval
python evaluate.py --config configs/reasoning_webvid8m.yaml --reasoning_strategy single_stage

WebVid8M & Dense WebVid8M

# WebVid8M
python generate_embeddings.py --config configs/webvid8m.yaml
python evaluate.py --config configs/webvid8m.yaml

# Dense WebVid8M
python generate_embeddings.py --config configs/dense_webvid8m.yaml
python evaluate.py --config configs/dense_webvid8m.yaml

Reference

Citation

If you find CoVR-R useful in your research, please cite our paper:

@inproceedings{thawakar2026covrr,
  title = {CoVR-R: Reason-Aware Composed Video Retrieval},
  author = {Thawakar, Omkar and
               Demidov, Dmitry and
               Potlapalli, Vaishnav and
               Bogireddy, Sai Prasanna Teja Reddy and
               Gajjala, Viswanatha Reddy and
               Lasheen, Alaa Mostafa and
               Anwer, Rao Muhammad and
               Khan, Fahad Shahbaz},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year = {2026}
}

CoVR-R: Reason-AwareComposed Video Retrieval

Retrieval That Understands Consequences

Why Reasoning is Necessary

Object State Transitions

Temporal Phase Progression

Cinematographic Reasoning

Cause-Effect Chains

The CoVR-R Benchmark

2,800 Curated Triplets

Hard Distractors

Structured Reasoning Traces

Multi-Dimensional Focus

Structured Reasoning Annotation Format

Reason–Then–Retrieve Framework

Reference + Edit

After-Effect Reasoning

Target Description & Retrieval

Retrieved Target Video

Importance-Weighted Pooling

State-of-the-Art Results

Key Ablations

🔹 Model Scaling (CoVR-R)

🔹 Pooling Strategy

Why This Matters

Keyword Matching is Insufficient

Temporal Reasoning is Critical

Cinematography Matters

LMM Reasoning Drives Retrieval

State Transitions Must Be Modeled

Zero-Shot Scalability

Quick Start

Install

CoVR-R (Reasoning-Based Retrieval)

WebVid8M & Dense WebVid8M

Citation

CoVR-R: Reason-Aware
Composed Video Retrieval