Sara Ghaboural^1*, Ketan More^1*, Ritesh Thawkar¹, Wafa Alghallabi¹, Omkar Thawakar¹,
Fahad S. Khan^1,², Hisham Cholakkal¹, Salman Khan^1,³, Rao M. Anwer^1,⁴,

¹Mohamed bin Zayed University of AI, ²Linköping University, ³Australian National University, ⁴Aalto University

TimeTravel: A multimodal prehistoric and historic benchmark for LMMs

A visual overview of TimeTravel, a benchmark evaluating LMMs across 10 historical regions and 266 cultures, featuring a radial categorization of civilizations, dynasties, and cultural periods. On the right, sample artifacts from various cultures highlight the dataset's diversity.

TimeTravel Examples

Explore our TimeTravel Dataset by selecting an archaeological region to view representative examples from different cultures and historical periods.

Abstract

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models’ capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel highlight their strengths and identify areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery.

Quantitative Evaluation and Results

The following tables present a comprehensive evaluation of various multimodal models on the TimeTravel benchmark. The first table compares model performance across multiple metrics, while the second analyzes their ability to describe archaeological artifacts from different civilizations, highlighting variations in accuracy and descriptive depth.

Model BLEU METEOR ROUGE-L SPICE BERTScore LLM-Judge

GPT-4o-0806 0.1758🏅 0.2439 0.1230🏅 0.1035🏅 0.8349🏅 0.3013🏅

Gemini-2.0-Flash 0.1072 0.2456 0.0884 0.0919 0.8127 0.2630

Gemini-1.5-Pro 0.1067 0.2406 0.0848 0.0901 0.8172 0.2276

GPT-4o-mini-0718 0.1369 0.2658🏅 0.1027 0.1001 0.8283 0.2492

Llama-3.2-Vision-Inst 0.1161 0.2072 0.1027 0.0648 0.8111 0.1255

Qwen-2.5-VL 0.1155 0.2648 0.0887 0.1002 0.8198 0.1792

Llava-Next 0.1118 0.2340 0.0961 0.0799 0.8246 0.1161

Model	BLEU	METEOR	ROUGE-L	SPICE	BERTScore	LLM-Judge
GPT-4o-0806	0.1758🏅	0.2439	0.1230🏅	0.1035🏅	0.8349🏅	0.3013🏅
Gemini-2.0-Flash	0.1072	0.2456	0.0884	0.0919	0.8127	0.2630
Gemini-1.5-Pro	0.1067	0.2406	0.0848	0.0901	0.8172	0.2276
GPT-4o-mini-0718	0.1369	0.2658🏅	0.1027	0.1001	0.8283	0.2492
Llama-3.2-Vision-Inst	0.1161	0.2072	0.1027	0.0648	0.8111	0.1255
Qwen-2.5-VL	0.1155	0.2648	0.0887	0.1002	0.8198	0.1792
Llava-Next	0.1118	0.2340	0.0961	0.0799	0.8246	0.1161

Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.

Model India Roman Emp. China British Isles Iran Iraq Japan Cent. America Greece Egypt

GPT-4o-0806 0.2491🏅 0.4463🏅 0.2491🏅 0.1899🏅 0.3522🏅 0.3545🏅 0.2228🏅 0.3144🏅 0.2757🏅 0.3649🏅

Gemini-2.0-Flash 0.1859 0.3358 0.2059 0.1556 0.3376 0.3071 0.2000 0.2677 0.2582 0.3602

Gemini-1.5-Pro 0.1118 0.2632 0.2139 0.1545 0.3320 0.2587 0.1871 0.2708 0.2088 0.2908

GPT-4o-mini-0718 0.2311 0.3612 0.2207 0.1866 0.2991 0.2632 0.2087 0.3195 0.2101 0.2501

Llama-3.2-Vision-Inst 0.0744 0.1450 0.1227 0.0777 0.2000 0.1155 0.1075 0.1553 0.1351 0.1201

Qwen-2.5-VL 0.0888 0.1578 0.1192 0.1713 0.2515 0.1576 0.1771 0.1442 0.1442 0.2660

Llava-Next 0.0788 0.0961 0.1455 0.1091 0.1464 0.1194 0.1353 0.1917 0.1111 0.0709

Model	India	Roman Emp.	China	British Isles	Iran	Iraq	Japan	Cent. America	Greece	Egypt
GPT-4o-0806	0.2491🏅	0.4463🏅	0.2491🏅	0.1899🏅	0.3522🏅	0.3545🏅	0.2228🏅	0.3144🏅	0.2757🏅	0.3649🏅
Gemini-2.0-Flash	0.1859	0.3358	0.2059	0.1556	0.3376	0.3071	0.2000	0.2677	0.2582	0.3602
Gemini-1.5-Pro	0.1118	0.2632	0.2139	0.1545	0.3320	0.2587	0.1871	0.2708	0.2088	0.2908
GPT-4o-mini-0718	0.2311	0.3612	0.2207	0.1866	0.2991	0.2632	0.2087	0.3195	0.2101	0.2501
Llama-3.2-Vision-Inst	0.0744	0.1450	0.1227	0.0777	0.2000	0.1155	0.1075	0.1553	0.1351	0.1201
Qwen-2.5-VL	0.0888	0.1578	0.1192	0.1713	0.2515	0.1576	0.1771	0.1442	0.1442	0.2660
Llava-Next	0.0788	0.0961	0.1455	0.1091	0.1464	0.1194	0.1353	0.1917	0.1111	0.0709

Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.

TimeTravel

A Comprehensive Benchmark

to Evaluate LMMs on Historical and Cultural Artifacts