A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Bhuiyan Sanjid Shafique^1♠, Ashmal Vayani^1,2♠, Muhammad Maaz^1♠, Hanoona Abdul Rasheed^1♠, Dinura Dissanayake^1♠, Mohammed Irfan^1♠, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal^1♠, Ivan Laptev^1♠, Shin’ichi Satoh^4♠, Michael Felsberg^6♠, Mubarak Shah^2,3♠, Salman Khan^1,5♠, Fahad Shahbaz Khan^1,6♠

^♠ Core Authors
¹Mohamed bin Zayed University of AI, ²University of Central Florida, ³Amazon, ⁴National Institute of Informatics,
⁵Australian National University, ⁶Linkoping University,

arXiv Dataset Code BibTex

Motivated by the rapid progress of large multimodal models (LMMs) in video-language tasks and their current limitations in handling cultural and linguistic diversity, we introduce ViMUL-Bench—a multilingual video QA benchmark spanning 14 languages that collectively represent over two-thirds of the global population. Unlike prior datasets, ViMUL-Bench emphasizes culturally grounded understanding, covering diverse domains such as festivals, customs, food, and heritage. It enables comprehensive evaluation of LMMs across both high- and low-resource languages, pushing the field toward more inclusive, globally relevant, and culturally aware video-language models.

Figure: ViMUL-Bench consists of carefully curated videos spanning 14 languages, with 8K manually verified annotations by native experts. It covers 15 diverse domains, incorporating real-world cultural elements such as regional landmarks, local cuisines, and traditional festivals. Additionally, we introduce ViMUL, a multilingual baseline for general and cultural video comprehension. Qualitative examples (top: Sinhala, bottom: Bengali) show that ViMUL performs favorably against recent vidLMMs in cultural inclusivity and overall understanding (errors in red, correct answers in green). ViMUL-Bench supports diverse question formats, including MCQs and both short/long open-ended VQAs.

ViLA

Video-Chat2

Video-ChatGPT

LLaVA-OneVision

LLaVA-Next

ViMUL (Ours)

Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released.

ViMUL enables inclusive evaluation of multilingual multimodal models for culturally grounded video understanding.

Main contributions:

ViMUL-Bench: We introduce ViMUL-Bench, the first benchmark for multilingual and multicultural video question answering. It spans 14 languages, including low-resource ones like Sinhala, Urdu, and Tamil, and covers 15 domains, such as traditional festivals, foods, rituals, and landmarks—culminating in over 8,000 manually verified QA pairs.
Culturally Curated Annotations: QA pairs are curated and verified by native speakers to ensure authenticity and cultural relevance, particularly in non-English and low-resource settings. Cultural videos are sourced using language-country-topic triplets and vetted for quality.
Multilingual Video LMM (ViMUL): We present ViMUL, a multilingual video-language model trained on 1.2 million machine-translated QA pairs. It achieves strong performance across both high- and low-resource languages, setting a new open-source baseline for cross-lingual, culturally aware video understanding.

ViMUL-LLM Architecture Overview

Figure: ViMUL is designed to comprehend and generate content in 14 different languages: Arabic, Bengali, Chinese, English, French, German, Hindi, Japanese, Russian, Sinhala, Spanish, Swedish, Tamil, and Urdu, covering at least two-thirds of the global population. The model employs a vision encoder to process video frames, followed by a vision-to language projector and an LLM. The projected features are then concatenated with the user query and fed into the LLM to generate a response.

ViMUL LLM architecture is derived from LLaVA-OneVision. It features a three-stage architecture: (1) a frozen SigLIP visual encoder, (2) a two-layer MLP vision-language projector, and (3) a decoder-only multilingual LLM based on Qwen-2.0. Input video frames are sampled at 1 FPS and passed through the SigLIP encoder. The output features are pooled and projected into the language model’s embedding space. These are concatenated with multilingual query tokens and fed into the LLM to generate culturally grounded responses. ViMUL is trained end-to-end using a next-token prediction loss over 1.2 million QA pairs across 14 languages, combining both open-ended and multiple-choice supervision via multilingual instruction tuning. This enables robust generalization across both high- and low-resource languages.

ViMUL-Bench Dataset Overview

Table: Comparison of video LMM benchmarks emphasizing multilingual and cultural understanding. Domains represent the aspects covered by each dataset for different languages. Annotation Type is categorized as follows: Human - Questions were created in the local language. Human+Auto – Questions were generated or translated using GPT-4/Google API and later validated by human experts. Auto: Questions were generated or translated automatically without human validation. ‘-’ indicates that information is not available.

Overview of ViMUL-Bench Dataset. ViMUL-Bench consists of 8,025 culturally grounded and linguistically diverse video QA pairs across 15 categories in 14 languages, spanning high- and low-resource languages. It incorporates video content from diverse regions and cultural settings, using 9 language scripts and covering multiple language families. ViMUL-Bench emphasizes spatio-temporal and cultural reasoning in real-world videos across the following categories:

Cultural Categories

Generic Categories

Dataset Collection and Verification Pipeline

Figure: Data collection and verification pipeline. Our benchmark consists of both cultural specific video content curated from scratch (left) and generic Video-QA pairs sourced from existing video LMM benchmarks (right). Cultural videos are scrapped using a (country, language, sub-topic) triplet and manually filtered for relevance and private information. With the help of native speakers, we create QA pairs for each language from scratch (except English), with cultural QA pairs translated into English using GPT-4o. Our ViMUL-Bench has diverse question types and features approximately 8K QA pairs in 14 languages.

Data Statistics

Our ViMUL-Bench dataset reflects the multilingual and cultural diversity essential for inclusive video understanding. It includes 8,025 high-quality QA pairs across 14 languages and 9 scripts, spanning both high- and low-resource settings, from 879 videos. The benchmark features 15 distinct categories (generic and cultural), and supports multiple question types, including multiple-choice and short/long open-ended formats. All samples are manually verified by native-language experts to ensure linguistic accuracy, cultural relevance, and robust multilingual coverage.

Figure: Data statistics for ViMUL-Bench where is displays the distribution of QA pairs among 14 languages (left) and distribution of QA pairs among 8 cultural categories (right).

Experimental results on ViMUL-Bench

We present our evaluations with 6 recent state-of-the-art LMMs along with ViMUL-LLM, across 14 languages, for both openended and mcq based QA pairs.

Performance of Open and Closed-Source LMMs on ViMUL-Bench

In the below heatmap figure, we present results for both open-source and closed-source models, on the ViMUL-Bench.

Performance comparison of video LMMs across 14 languages on ViMUL-Bench. Average accuracy is reported across all question types for each language. Each box represents a model’s accuracy for a specific language, with darker shades indicating higher accuracy. The results show that the closed-source model, GPT-4o, generally outperforms its open-source counterparts. In contrast to highresource languages, methods struggle on low-resource languages (e.g., Sinhala, Urdu, Tamil). Among open-source models, our ViMUL provides a better tradeoff between high and low-resource languages, achieving an overall gain of 2% over LLaVA-OneVision.

Performance based on Low Vs High-resource languages and various categories upon ViMUL-Bench

Figure: Figure on the (left) illustrates performance comparison of open-source versus closed-source models, with a distinction between low-resource and high-resource languages in our ViMUL-Bench. On the other hand, the Figure on the (right) portrays the performance of different video LMMs across 15 diverse categories (both generic and cultural) in our ViMUL-Bench. The categories in black represents generic categories, and categories in blue represents the cultural categories.

Qualitative examples of ViMUL-LLM on our ViMUL-Bench dataset

Success cases prediction: We present some qualitative examples for success cases by ViMUL-LLM on various language and categories.

BibTeX

@misc{shafique2025culturallydiversemultilingualmultimodalvideo,
      title={A Culturally-diverse Multilingual Multimodal Video Benchmark & Model}, 
      author={Bhuiyan Sanjid Shafique and Ashmal Vayani and Muhammad Maaz and Hanoona Abdul Rasheed and Dinura Dissanayake and Mohammed Irfan Kurpath and Yahya Hmaiti and Go Inoue and Jean Lahoud and Md. Safirur Rashid and Shadid Intisar Quasem and Maheen Fatima and Franco Vidal and Mykola Maslych and Ketan Pravin More and Sanoojan Baliah and Hasindri Watawana and Yuhao Li and Fabian Farestam and Leon Schaller and Roman Tymtsiv and Simon Weber and Hisham Cholakkal and Ivan Laptev and Shin'ichi Satoh and Michael Felsberg and Mubarak Shah and Salman Khan and Fahad Shahbaz Khan},
      year={2025},
      eprint={2506.07032},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.07032}, 
}