MMLU-ProX

A Multilingual Benchmark for Advanced Large Language Model Evaluation

Evaluating language models across 29 typologically diverse languages with challenging reasoning tasks

*The University of Tokyo, Japan, Duke-NUS Medical School, Singapore, Waseda University, Japan,
§Northwestern University, United States, Carnegie Mellon University, United States,
#Yale University, United States, University College Dublin, Ireland, Nanyang Technological University, Singapore,
Smartor LLC, Japan, University of California, Berkeley, United States, University of New South Wales, Australia,
ΔSingapore Management University, Singapore, ΛNew York University, United States,
ΩPolytechnique Montreal, Canada, ΨUniversity of Geneva, Switzerland, ΓUniversity of Alberta, Canada

Abstract

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

Languages Covered (29)

🌐
English
EN
Chinese
ZH
Japanese
JA
Korean
KO
Fr
French
FR
De
German
DE
Es
Spanish
ES
Pt
Portuguese
PT
ع
Arabic
AR
Thai
TH
हि
Hindi
HI
বা
Bengali
BN
Sw
Swahili
SW
Af
Afrikaans
AF
Cs
Czech
CS
Hu
Hungarian
HU
Id
Indonesian
ID
It
Italian
IT
मर
Marathi
MR
ने
Nepali
NE
Ру
Russian
RU
Ср
Serbian
SR
తె
Telugu
TE
Ук
Ukrainian
UK
اُر
Urdu
UR
Vi
Vietnamese
VI
Wo
Wolof
WO
Yo
Yoruba
YO
Zu
Zulu
ZU

Features

  • Extensive Language Coverage: 29 typologically diverse languages from various language families
  • Reasoning-Focused Design: Built upon MMLU-Pro, maintaining its challenging nature and reasoning focus
  • High-Quality Translations: Semi-automatic translation process with expert verification
  • Comprehensive Evaluation: Tested on 36 state-of-the-art LLMs with multiple prompting strategies
  • Open Source: Dataset and evaluation code available to the research community

Performance Results (5-shot CoT)

Model Avg EN FR DE ES PT IT HI BN UR TE MR NE ZH JA KO VI TH ID AR AF SW WO YO ZU RU UK SR CS HU
Qwen3-235B-Think 74.9 80.7 80.6 80.4 80.7 80.5 80.9 78.7 77.8 76.1 77.9 78.5 78.1 77.4 77.1 78.3 72.6 77.1 79.9 78.7 80.6 70.8 36.9 49.3 46.4 77.0 78.8 80.2 80.5 79.8
DeepSeek-R1 75.5 79.5 81.3 76.7 80.2 78.0 79.9 77.5 66.6 76.2 71.9 70.4 78.9 78.0 76.9 76.7 76.3 78.7 81.3 76.2 80.9 75.0 58.6 57.0 67.3 76.4 76.8 80.9 76.8 79.1
GPT-4.1 72.7 79.8 75.7 76.4 77.8 77.0 78.2 74.5 72.2 68.3 65.9 72.2 74.2 75.5 75.6 75.4 76.7 75.1 75.6 74.1 77.2 71.9 43.2 53.4 65.0 71.2 76.4 76.9 77.5 76.6
DeepSeek-V3 70.5 79.6 76.3 75.1 76.9 75.7 75.9 71.6 69.8 70.3 67.6 69.8 69.3 73.9 72.9 70.7 75.4 71.2 75.8 72.4 72.9 63.4 47.3 47.7 53.7 74.9 74.2 72.9 74.7 71.4
o4-mini 69.3 73.7 72.2 73.5 74.7 74.1 73.9 71.8 70.1 72.0 69.1 70.7 71.5 72.6 71.5 73.2 73.4 72.0 73.8 72.5 73.5 66.9 24.1 54.9 61.2 62.0 73.3 72.6 73.5 72.6
Qwen3-235B 66.7 73.5 72.5 71.3 73.2 73.1 73.7 67.6 67.7 68.7 66.7 67.7 67.8 70.5 68.8 69.6 71.4 68.8 72.5 70.1 71.1 56.3 26.6 40.2 46.2 72.9 72.5 71.1 71.8 70.1
Qwen3-32B-Think 66.3 74.9 72.1 71.7 72.8 72.7 73.5 70.4 66.4 70.8 70.3 70.7 70.7 68.7 70.2 71.2 72.4 70.4 73.4 70.4 72.4 56.7 26.6 18.8 35.2 69.1 73.5 72.3 72.8 71.1
Qwen3-14B-Think 65.4 74.7 72.2 71.4 72.2 71.6 73.0 67.9 66.8 68.0 64.6 66.9 67.3 66.5 70.8 70.0 71.4 69.8 72.3 69.1 71.3 48.0 28.3 32.5 32.3 72.2 71.3 71.6 71.1 69.8

Key Findings

  • Consistent performance degradation from high-resource to low-resource languages across all models, with particularly notable challenges in African languages like Wolof, Yoruba, and Zulu
  • Larger models consistently outperform smaller counterparts within the same family across all 29 languages
  • Different prompting strategies show varying effectiveness depending on language resource levels and script types
  • Reasoning-enhanced training yields inconsistent benefits across different language families and geographic regions
  • European languages generally show higher performance compared to Asian, African, and South Asian languages, highlighting resource availability disparities

Citation

@misc{xuan2025mmluproxmultilingualbenchmarkadvanced, title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li}, year={2025}, eprint={2503.10497}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.10497}, }

Acknowledgments

This research was supported by several organizations. The Japan Society for the JSPS KAKENHI provided funding under Grant Number 24K20832. Additional support was received from JST ActX, Grant Number JPMJAX24CU. We also acknowledge the contributions of NVIDIA through their Academic Grant Program and Google via the Gemma Academic Program.