Kaleidoscope

Evaluating Multimodal LLMs Across Language Boundaries

About Kaleidoscope

Kaleidoscope is a comprehensive evaluation suite of multimodal multiple-choice question answering (MCQA) derived from diverse examination sources across the globe. With 20,911 questions spanning 18 languages and 14 subjects, our benchmark assesses the reasoning and knowledge capabilities of multimodal LLMs in environments where they are intended to be deployed.

Our benchmark emphasizes multimodal evaluation, with 55% of questions requiring image understanding for accurate resolution, making it the largest multilingual and multimodal MCQA benchmark to date. We include region-specific and non-region-specific knowledge to ensure a fair and comprehensive evaluation of AI systems across different cultural contexts.

Dataset Statistics

20,911

Total Questions

18

Languages

14

Subject Areas

55%

Multimodal Questions

Key Features

Multilingual Coverage

Extensive coverage across 18 languages including low-resource languages like Nepali, Lithuanian, and Telugu, ensuring a comprehensive evaluation across linguistic boundaries.

Multimodal Evaluation

55% of questions require image understanding for accurate resolution, with images categorized as graphs, tables, diagrams, scientific formulas, and more.

Diverse Knowledge Domains

Questions span 14 subjects including science, humanities, mathematics, and professional fields, covering various educational levels from high school to professional licensing exams.

Human-Crafted Questions

All questions are derived from real examinations created by domain experts, ensuring high quality and reflecting authentic assessment scenarios.

Structured Format

Questions follow a standardized multiple-choice format with detailed metadata including subject, level, and image information type to enable fine-grained analysis.

Cultural Relevance

Includes both region-specific and region-agnostic knowledge, ensuring evaluation that respects cultural and geographical diversity.

Language Statistics