TimeTravel

TimeTravel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

[Sara Ghaboura](https://huggingface.co/SLMLAH) *   [Ketan More](https://github.com/ketanmore2002) *   [Ritesh Thawkar](https://huggingface.co/SLMLAH)   [Wafa Alghallabi](https://huggingface.co/SLMLAH)   [Omkar Thawakar](https://omkarthawakar.github.io)  
[Fahad Shahbaz Khan](https://scholar.google.com/citations?hl=en&user=zvaeYnUAAAAJ)   [Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)   [Salman Khan](https://scholar.google.com/citations?hl=en&user=M59O9lkAAAAJ)   [Rao M. Anwer](https://scholar.google.com/citations?hl=en&user=_KlvMVoAAAAJ)

[![arXiv](https://img.shields.io/badge/arXiv-2502.14865-F6D769)](https://arxiv.org/abs/2502.14865) [![Our Page](https://img.shields.io/badge/Visit-Our%20Page-E7DAB7?style=flat)](https://mbzuai-oryx.github.io/TimeTravel/) [![GitHub issues](https://img.shields.io/github/issues/mbzuai-oryx/Camel-Bench?color=E5D5C1&label=issues&style=flat)](https://github.com/mbzuai-oryx/TimeTravel/issues) [![GitHub stars](https://img.shields.io/github/stars/mbzuai-oryx/TimeTravel?color=FAF1D9&style=flat)](https://github.com/mbzuai-oryx/TimeTravel/stargazers) [![GitHub license](https://img.shields.io/github/license/mbzuai-oryx/Camel-Bench?color=F1E9E3)](https://github.com/mbzuai-oryx/TimeTravel/blob/main/LICENSE)
*Equal Contribution

If you like our project, please give us a star ⭐ on GitHub for the latest update.




Latest Updates

🤗 [19 Feb 2025] TimeTravel dataset available on HuggingFace.
🔥 [20 Feb 2025] TimeTravel the 1st comprehensive open-source benchmark on Historical and Cultural Artifacts is released.


hourg_logo Overview

TimeTravel is the first comprehensive benchmark for AI-driven historical artifact analysis, designed to identify artifacts within their historical era and cultural context. Spanning 266 cultural groups across 10 regions, it prioritizes historical knowledge, contextual reasoning, and cultural preservation, unlike generic object recognition benchmarks. With over 10,000 expert-verified samples, TimeTravel sets a new standard for evaluating multimodal models in historical research, cross-civilizational analysis, and AI-powered cultural heritage preservation.

<div style="display: flex; justify-content: space-between; align="center;">

   Figure 1         Figure 2
</div>
Figure 1. Left: TimeTravel Taxonomy maps artifacts from 10 civilizations, 266 cultures, and 10k+ verified samples for AI-driven historical analysis. Right: Regional dataset distribution by archaeological provenance, with Greece holding the largest share (18%) and balanced regional coverage.



## 🌟 Key Features

Key Features of TimeTravel


pipeline TimeTravel Creation Pipeline

The TimeTravel dataset follows a structured pipeline to ensure the accuracy, completeness, and contextual richness of historical artifacts.

<img src="asset/pipe_last.png" width="2600px" height="250px" alt="pipeline" style="margin-right: 2px";/>

Figure 2. TimeTravel Data Pipeline: A structured workflow for collecting, processing, and refining museum artifact data, integrating GPT-4o-generated descriptions with expert validation for benchmark accuracy.compliance.
Our approach consists of four key phases: - **Data Selection:** Curated 10,250 artifacts from museum collections, spanning 266 cultural groups, with expert validation to ensure historical accuracy and diversity.
- **Data Cleaning:** Addressed missing or incomplete metadata (titles, dates, iconography) by cross-referencing museum archives and academic sources, ensuring data consistency.
- **Generation & Verification:** Used GPT-4o to generate context-aware descriptions, which were refined and validated by historians and archaeologists for authenticity.
- **Data Aggregation:** Standardized and structured dataset into image-text pairs, making it a valuable resource for AI-driven historical analysis and cultural heritage research.

## 🎯 Quantitative Evaluation and Results The following tables present a comprehensive evaluation of various multimodal models on the TimeTravel benchmark. The first table compares model performance across multiple metrics, while the second analyzes their ability to describe archaeological artifacts from different civilizations, highlighting variations in accuracy and descriptive depth. <div align="center";>
Model BLEU METEOR ROUGE-L SPICE BERTScore LLM-Judge
GPT-4o-0806 0.1758🏅 0.2439 0.1230🏅 0.1035🏅 0.8349🏅 0.3013🏅
Gemini-2.0-Flash 0.1072 0.2456 0.0884 0.0919 0.8127 0.2630
Gemini-1.5-Pro 0.1067 0.2406 0.0848 0.0901 0.8172 0.2276
GPT-4o-mini-0718 0.1369 0.2658🏅 0.1027 0.1001 0.8283 0.2492
Llama-3.2-Vision-Inst 0.1161 0.2072 0.1027 0.0648 0.8111 0.1255
Qwen-2.5-VL 0.1155 0.2648 0.0887 0.1002 0.8198 0.1792
Llava-Next 0.1118 0.2340 0.0961 0.0799 0.8246 0.1161

Table: Performance comparison of various closed and open-source models on our proposed TimeTravel benchmark.

Model India Roman Emp. China British Isles Iran Iraq Japan Cent. America Greece Egypt
GPT-4o-0806 0.2491🏅 0.4463🏅 0.2491🏅 0.1899🏅 0.3522🏅 0.3545🏅 0.2228🏅 0.3144🏅 0.2757🏅 0.3649🏅
Gemini-2.0-Flash 0.1859 0.3358 0.2059 0.1556 0.3376 0.3071 0.2000 0.2677 0.2582 0.3602
Gemini-1.5-Pro 0.1118 0.2632 0.2139 0.1545 0.3320 0.2587 0.1871 0.2708 0.2088 0.2908
GPT-4o-mini-0718 0.2311 0.3612 0.2207 0.1866 0.2991 0.2632 0.2087 0.3195 0.2101 0.2501
Llama-3.2-Vision-Inst 0.0744 0.1450 0.1227 0.0777 0.2000 0.1155 0.1075 0.1553 0.1351 0.1201
Qwen-2.5-VL 0.0888 0.1578 0.1192 0.1713 0.2515 0.1576 0.1771 0.1442 0.1442 0.2660
Llava-Next 0.0788 0.0961 0.1455 0.1091 0.1464 0.1194 0.1353 0.1917 0.1111 0.0709

Table: Analysis of LLM-Judge evaluation of various models in describing archaeological artifacts across civilizations from different geographical locations.

</div>

## 🧐 TimeTravel Dataset Examples Figures 3 and 4 showcase the cultural and material diversity of the TimeTravel dataset alongside a cross-model comparison, highlighting variations in artifact representation, historical periods, material compositions, and descriptive accuracy across different AI models. <div style="display: flex; justify-content: center; align-items: center;>

models_compare

Figure 3. Cultural and Material Diversity: TimeTravel spans civilizations from Ancient Egypt to Japan, covering prehistoric to medieval eras with artifacts in ceramics, metals, and stone, showcasing historical craftsmanship and cultural heritage.
models_compare
Figure 4. Cross-Model Comparison: Variations in descriptive depth and accuracy across open- and closed-source models, highlighting interpretative differences and alignment with ground truth.

</div>
## Evaluation Please refer to Evaluation folder to reproduce the results. ## ⚖️ License This project is licensed under the MIT License - see the [LICENSE](/TimeTravel/LICENSE) file for details.

## 💬 Contact us For questions or suggestions, feel free to reach out to us on [GitHub Discussions](https://github.com/mbzuai-oryx/TimeTravel/discussions). ## 📚 Citation If you use TimeTravle dataset in your research, please consider citing: ```bibtex @misc{ghaboura2025timetravelcomprehensivebenchmark, title={Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts}, author={Sara Ghaboura and Ketan More and Ritesh Thawkar and Wafa Alghallabi and Omkar Thawakar and Fahad Shahbaz Khan and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer}, year={2025}, eprint={2502.14865}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.14865}, } ```
---