Abstract
Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a multilingual benchmark spanning 13 typologically diverse languages. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by lingual experts to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using chain-of-thought (CoT) and direct answer prompting strategies, analyzing their performance across lingual and cultural boundaries.
MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
Languages Covered (13)
Features
- Extensive Language Coverage: 13 typologically diverse languages from various language families
- Reasoning-Focused Design: Built upon MMLU-Pro, maintaining its challenging nature and reasoning focus
- High-Quality Translations: Semi-automatic translation process with expert verification
- Comprehensive Evaluation: Tested on 25 state-of-the-art LLMs with multiple prompting strategies
- Open Source: Dataset and evaluation code available to the research community
Performance Results (5-shot CoT)
Models | Overall | EN | ZH | JA | KO | FR | DE | ES | PT | AR | TH | HI | BN | SW |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen2.5-72B | 62.0 | 70.3 | 65.9 | 63.4 | 62.1 | 67.1 | 65.9 | 66.5 | 66.6 | 62.1 | 60.1 | 58.0 | 57.6 | 40.1 |
QwQ-32B | 60.2 | 70.7 | 65.7 | 62.8 | 62.6 | 67.4 | 63.0 | 66.7 | 65.3 | 62.0 | 61.7 | 49.1 | 52.7 | 32.8 |
Llama3.1-405B | 60.1 | 68.8 | 62.5 | 59.9 | 51.6 | 65.1 | 64.4 | 64.9 | 64.3 | 55.4 | 59.1 | 58.0 | 54.9 | 52.1 |
Llama3.3-70B | 57.1 | 65.7 | 58.4 | 57.0 | 54.5 | 62.1 | 59.8 | 61.5 | 61.4 | 51.0 | 56.0 | 55.4 | 50.1 | 49.0 |
Phi4-14B | 55.2 | 63.7 | 58.8 | 54.7 | 54.5 | 62.9 | 62.2 | 63.0 | 62.5 | 54.6 | 49.9 | 49.4 | 43.7 | 37.9 |
Key Findings
- Consistent performance degradation from high-resource to low-resource languages across all models
- Larger models consistently outperform smaller counterparts within the same family
- Different prompting strategies show varying effectiveness depending on language resource levels
- Reasoning-enhanced training yields inconsistent benefits across different languages
Citation
Acknowledgments
This research was supported by several organizations. The Japan Society for the JSPS KAKENHI provided funding under Grant Number 24K20832. Additional support was received from JST ActX, Grant Number JPMJAX24CU. We also acknowledge the contributions of NVIDIA through their Academic Grant Program and Google via the Gemma Academic Program.