MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation

Abstract

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it challenging to comprehensively assess LLMs' performance in the multilingual setting. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, with gaps of up to 24.3%. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

Languages Covered (29)

🌐

English

中

Chinese

日

Japanese

한

Korean

French

German

Spanish

Portuguese

Arabic

ไ

Thai

हि

Hindi

বা

Bengali

Swahili

Afrikaans

Czech

Hungarian

Indonesian

Italian

मर

Marathi

ने

Nepali

Ру

Russian

Ср

Serbian

తె

Telugu

Ук

Ukrainian

اُر

Urdu

Vietnamese

Wolof

Yoruba

Zulu

Features

Extensive Language Coverage: 29 typologically diverse languages from various language families
Reasoning-Focused Design: Built upon MMLU-Pro, maintaining its challenging nature and reasoning focus
High-Quality Translations: Semi-automatic translation process with expert verification
Comprehensive Evaluation: Tested on 36 state-of-the-art LLMs with multiple prompting strategies
Open Source: Dataset and evaluation code available to the research community

Performance Results (5-shot CoT)

Model	Avg	EN	FR	DE	ES	PT	IT	HI	BN	UR	TE	MR	NE	ZH	JA	KO	VI	TH	ID	AR	AF	SW	WO	YO	ZU	RU	UK	SR	CS	HU
Qwen3-235B-Think	74.9	80.7	80.6	80.4	80.7	80.5	80.9	78.7	77.8	76.1	77.9	78.5	78.1	77.4	77.1	78.3	72.6	77.1	79.9	78.7	80.6	70.8	36.9	49.3	46.4	77.0	78.8	80.2	80.5	79.8
DeepSeek-R1	75.5	79.5	81.3	76.7	80.2	78.0	79.9	77.5	66.6	76.2	71.9	70.4	78.9	78.0	76.9	76.7	76.3	78.7	81.3	76.2	80.9	75.0	58.6	57.0	67.3	76.4	76.8	80.9	76.8	79.1
GPT-4.1	72.7	79.8	75.7	76.4	77.8	77.0	78.2	74.5	72.2	68.3	65.9	72.2	74.2	75.5	75.6	75.4	76.7	75.1	75.6	74.1	77.2	71.9	43.2	53.4	65.0	71.2	76.4	76.9	77.5	76.6
DeepSeek-V3	70.5	79.6	76.3	75.1	76.9	75.7	75.9	71.6	69.8	70.3	67.6	69.8	69.3	73.9	72.9	70.7	75.4	71.2	75.8	72.4	72.9	63.4	47.3	47.7	53.7	74.9	74.2	72.9	74.7	71.4
o4-mini	69.3	73.7	72.2	73.5	74.7	74.1	73.9	71.8	70.1	72.0	69.1	70.7	71.5	72.6	71.5	73.2	73.4	72.0	73.8	72.5	73.5	66.9	24.1	54.9	61.2	62.0	73.3	72.6	73.5	72.6
Qwen3-235B	66.7	73.5	72.5	71.3	73.2	73.1	73.7	67.6	67.7	68.7	66.7	67.7	67.8	70.5	68.8	69.6	71.4	68.8	72.5	70.1	71.1	56.3	26.6	40.2	46.2	72.9	72.5	71.1	71.8	70.1
Qwen3-32B-Think	66.3	74.9	72.1	71.7	72.8	72.7	73.5	70.4	66.4	70.8	70.3	70.7	70.7	68.7	70.2	71.2	72.4	70.4	73.4	70.4	72.4	56.7	26.6	18.8	35.2	69.1	73.5	72.3	72.8	71.1
Qwen3-14B-Think	65.4	74.7	72.2	71.4	72.2	71.6	73.0	67.9	66.8	68.0	64.6	66.9	67.3	66.5	70.8	70.0	71.4	69.8	72.3	69.1	71.3	48.0	28.3	32.5	32.3	72.2	71.3	71.6	71.1	69.8

Key Findings

Consistent performance degradation from high-resource to low-resource languages across all models, with particularly notable challenges in African languages like Wolof, Yoruba, and Zulu
Larger models consistently outperform smaller counterparts within the same family across all 29 languages
Different prompting strategies show varying effectiveness depending on language resource levels and script types
Reasoning-enhanced training yields inconsistent benefits across different language families and geographic regions
European languages generally show higher performance compared to Asian, African, and South Asian languages, highlighting resource availability disparities

Citation

@misc{xuan2025mmluproxmultilingualbenchmarkadvanced,
      title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, 
      author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Aosong Feng and Dairui Liu and Yun Xing and Junjue Wang and Fan Gao and Jinghui Lu and Yuang Jiang and Huitao Li and Xin Li and Kunyu Yu and Ruihai Dong and Shangding Gu and Yuekang Li and Xiaofei Xie and Felix Juefei-Xu and Foutse Khomh and Osamu Yoshie and Qingyu Chen and Douglas Teodoro and Nan Liu and Randy Goebel and Lei Ma and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li},
      year={2025},
      eprint={2503.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.10497}, 
}
                

Acknowledgments

This research was supported by several organizations. The Japan Society for the JSPS KAKENHI provided funding under Grant Number 24K20832. Additional support was received from JST ActX, Grant Number JPMJAX24CU. We also acknowledge the contributions of NVIDIA through their Academic Grant Program and Google via the Gemma Academic Program.