MMLU-ProX

A Multilingual Benchmark for Advanced Large Language Model Evaluation

Evaluating language models across 13 typologically diverse languages with challenging reasoning tasks

Weihao Xuan1, Rui Yang2, Heli Qi3, Qingcheng Zeng4, Yunze Xiao5, Yun Xing6, Junjue Wang1 Huitao Li2, Xin Li2, Kunyu Yu2, Nan Liu2, Qingyu Chen7, Douglas Teodoro8 Edison Marrese-Taylor1, Shijian Lu6, Yusuke Iwasawa1, Yutaka Matsuo1, Irene Li1
1The University of Tokyo, 2Duke-NUS Medical School, 3Waseda University
4Northwestern University, 5Carnegie Mellon University
6Nanyang Technological University, 7Yale University, 8University of Geneva

Abstract

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a multilingual benchmark spanning 13 typologically diverse languages. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by lingual experts to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using chain-of-thought (CoT) and direct answer prompting strategies, analyzing their performance across lingual and cultural boundaries.

MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

Languages Covered (13)

🇬🇧
English
EN
🇨🇳
Chinese
ZH
🇯🇵
Japanese
JA
🇰🇷
Korean
KO
🇫🇷
French
FR
🇩🇪
German
DE
🇪🇸
Spanish
ES
🇵🇹
Portuguese
PT
🇸🇦
Arabic
AR
🇹🇭
Thai
TH
🇮🇳
Hindi
HI
🇧🇩
Bengali
BN
🇰🇪
Swahili
SW

Features

  • Extensive Language Coverage: 13 typologically diverse languages from various language families
  • Reasoning-Focused Design: Built upon MMLU-Pro, maintaining its challenging nature and reasoning focus
  • High-Quality Translations: Semi-automatic translation process with expert verification
  • Comprehensive Evaluation: Tested on 25 state-of-the-art LLMs with multiple prompting strategies
  • Open Source: Dataset and evaluation code available to the research community

Performance Results (5-shot CoT)

Models Overall EN ZH JA KO FR DE ES PT AR TH HI BN SW
Qwen2.5-72B 62.0 70.3 65.9 63.4 62.1 67.1 65.9 66.5 66.6 62.1 60.1 58.0 57.6 40.1
QwQ-32B 60.2 70.7 65.7 62.8 62.6 67.4 63.0 66.7 65.3 62.0 61.7 49.1 52.7 32.8
Llama3.1-405B 60.1 68.8 62.5 59.9 51.6 65.1 64.4 64.9 64.3 55.4 59.1 58.0 54.9 52.1
Llama3.3-70B 57.1 65.7 58.4 57.0 54.5 62.1 59.8 61.5 61.4 51.0 56.0 55.4 50.1 49.0
Phi4-14B 55.2 63.7 58.8 54.7 54.5 62.9 62.2 63.0 62.5 54.6 49.9 49.4 43.7 37.9

Key Findings

  • Consistent performance degradation from high-resource to low-resource languages across all models
  • Larger models consistently outperform smaller counterparts within the same family
  • Different prompting strategies show varying effectiveness depending on language resource levels
  • Reasoning-enhanced training yields inconsistent benefits across different languages

Citation

@misc{mmluprox, title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation}, author={Weihao Xuan and Rui Yang and Heli Qi and Qingcheng Zeng and Yunze Xiao and Yun Xing and Junjue Wang and Huitao Li and Xin Li and Kunyu Yu and Nan Liu and Qingyu Chen and Douglas Teodoro and Edison Marrese-Taylor and Shijian Lu and Yusuke Iwasawa and Yutaka Matsuo and Irene Li}, year={2025}, eprint={2503.10497}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.10497}, }

Acknowledgments

This research was supported by several organizations. The Japan Society for the JSPS KAKENHI provided funding under Grant Number 24K20832. Additional support was received from JST ActX, Grant Number JPMJAX24CU. We also acknowledge the contributions of NVIDIA through their Academic Grant Program and Google via the Gemma Academic Program.