If you like our project, please give us a star ⭐ on GitHub for the latest update.
Latest Updates
🤗 [19 Feb 2025] TimeTravel dataset available on HuggingFace.
🔥 [20 Feb 2025] TimeTravel the 1st comprehensive open-source benchmark on Historical and Cultural Artifacts is released.
Overview
TimeTravel is the first comprehensive benchmark for AI-driven historical artifact analysis, designed to identify artifacts within their historical era and cultural context. Spanning 266 cultural groups across 10 regions, it prioritizes historical knowledge, contextual reasoning, and cultural preservation, unlike generic object recognition benchmarks. With over 10,000 expert-verified samples, TimeTravel sets a new standard for evaluating multimodal models in historical research, cross-civilizational analysis, and AI-powered cultural heritage preservation.
Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.
## 🌟 Key Features
Key Features of TimeTravel
First Historical Artifact Benchmark: The 1st large-scale multimodal benchmark for AI-driven historical artifact analysis
Broad Coverage: It spans across 10 civilizations and 266 cultural groups.
Expert-Verified Samples: Over 10k samples include manuscripts, inscriptions, sculptures, and archaeological artifacts, manually curated by historians and archaeologists.
Structured Taxonomy: Provides a hierarchical framework for artifact classification, interpretation, and cross-civilizational analysis.
AI Evaluation Framework: Assesses GPT-4V, LLaVA, and other LMMs on historical knowledge, contextual reasoning, and multimodal understanding.
Bridging AI and Cultural Heritage: Enables AI-driven historical research, archaeological analysis, and cultural preservation.
Open-Source & Standardized: A publicly available dataset and evaluation framework to advance AI applications in history and archaeology.
TimeTravel Creation Pipeline
The TimeTravel dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness of historical artifacts.
Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.
Our approach consists of four key phases:
- **Data Selection:** Curated 10,250 artifacts from museum collections, spanning 266 cultural groups, with expert validation to ensure historical accuracy and diversity.
- **Data Cleaning:** Addressed missing or incomplete metadata (titles, dates, iconography) by cross-referencing museum archives and academic sources, ensuring data consistency.
- **Generation & Verification:** Used GPT-4o to generate context-aware descriptions, which were refined and validated by historians and archaeologists for authenticity.
- **Data Aggregation:** Standardized and structured dataset into image-text pairs, making it a valuable resource for AI-driven historical analysis and cultural heritage research.
## 🎯 Quantitative Evaluation and Results
The following tables present a comprehensive evaluation of various multimodal models on the TimeTravel benchmark. The first table compares model performance across multiple metrics, while the second analyzes their ability to describe archaeological artifacts from different civilizations, highlighting variations in accuracy and descriptive depth.
<div align="center";>
Model
BLEU
METEOR
ROUGE-L
SPICE
BERTScore
LLM-Judge
GPT-4o-0806
0.1758🏅
0.2439
0.1230🏅
0.1035🏅
0.8349🏅
0.3013🏅
Gemini-2.0-Flash
0.1072
0.2456
0.0884
0.0919
0.8127
0.2630
Gemini-1.5-Pro
0.1067
0.2406
0.0848
0.0901
0.8172
0.2276
GPT-4o-mini-0718
0.1369
0.2658🏅
0.1027
0.1001
0.8283
0.2492
Llama-3.2-Vision-Inst
0.1161
0.2072
0.1027
0.0648
0.8111
0.1255
Qwen-2.5-VL
0.1155
0.2648
0.0887
0.1002
0.8198
0.1792
Llava-Next
0.1118
0.2340
0.0961
0.0799
0.8246
0.1161
Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.
Model
India
Roman Emp.
China
British Isles
Iran
Iraq
Japan
Cent. America
Greece
Egypt
GPT-4o-0806
0.2491🏅
0.4463🏅
0.2491🏅
0.1899🏅
0.3522🏅
0.3545🏅
0.2228🏅
0.3144🏅
0.2757🏅
0.3649🏅
Gemini-2.0-Flash
0.1859
0.3358
0.2059
0.1556
0.3376
0.3071
0.2000
0.2677
0.2582
0.3602
Gemini-1.5-Pro
0.1118
0.2632
0.2139
0.1545
0.3320
0.2587
0.1871
0.2708
0.2088
0.2908
GPT-4o-mini-0718
0.2311
0.3612
0.2207
0.1866
0.2991
0.2632
0.2087
0.3195
0.2101
0.2501
Llama-3.2-Vision-Inst
0.0744
0.1450
0.1227
0.0777
0.2000
0.1155
0.1075
0.1553
0.1351
0.1201
Qwen-2.5-VL
0.0888
0.1578
0.1192
0.1713
0.2515
0.1576
0.1771
0.1442
0.1442
0.2660
Llava-Next
0.0788
0.0961
0.1455
0.1091
0.1464
0.1194
0.1353
0.1917
0.1111
0.0709
Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.
</div>
## 🧐 TimeTravel Dataset Examples
Figures 3 and 4 showcase the cultural and material diversity of the TimeTravel dataset alongside a cross-model comparison, highlighting variations in artifact representation, historical periods, material compositions, and descriptive accuracy across different AI models.
<div style="display: flex; justify-content: center; align-items: center;>
Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.
Figure 4. Cross-Model Comparison: Variations in descriptive depth and accuracy across open- and closed-source models, highlighting interpretative differences and alignment with ground truth.
</div>
## Evaluation
Please refer to Evaluation folder to reproduce the results.
## ⚖️ License
This project is licensed under the MIT License - see the [LICENSE](/TimeTravel/LICENSE) file for details.
## 💬 Contact us
For questions or suggestions, feel free to reach out to us on [GitHub Discussions](https://github.com/mbzuai-oryx/TimeTravel/discussions).
## 📚 Citation
If you use TimeTravle dataset in your research, please consider citing:
```bibtex
@misc{ghaboura2025timetravelcomprehensivebenchmark,
title={Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts},
author={Sara Ghaboura and Ketan More and Ritesh Thawkar and Wafa Alghallabi and Omkar Thawakar and Fahad Shahbaz Khan and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2502.14865},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14865},
}
```
---