
The new MMLU-Pro benchmark addresses several limitations of existing benchmarks like GLUE and SuperGLUE. Firstly, it incorporates more challenging, reasoning-intensive tasks, which helps to differentiate closely performing systems and better evaluates the advanced reasoning capabilities of state-of-the-art LLMs. Secondly, it increases the number of distractor options from three to nine, making the benchmark more robust and less sensitive to prompt variations. Lastly, MMLU-Pro involves a two-round expert review process to reduce dataset noise and enhance question quality, ensuring a broader and more discriminative evaluation across 14 diverse domains, encompassing over 12,000 questions.

MMLU-Pro enhances the challenge and complexity of tasks compared to its predecessor, MMLU, in several ways:
Increased Answer Options: MMLU-Pro significantly increases the number of answer options from four to ten per question1. This enhancement makes the evaluation more realistic and challenging, as it reduces the score from random guessing and necessitates more discerning reasoning for correct selection.
Higher Difficulty: The new dataset includes more reasoning-focused problems, increasing the overall difficulty. As a result, models utilizing Chain-of-Thought (CoT) reasoning can outperform Perplexity (PPL) by up to 20%.
Performance Stability: Due to the increased options, model performance on MMLU-Pro is more stable. For instance, Llama-2-7B shows less than 1% performance variance with different prompts, compared to 4-5% on the original MMLU.
More Robust and Challenging Questions: MMLU-Pro incorporates more complex, reasoning-intensive tasks, increasing the number of distractor options from three to nine1. This feature provides a broader and more discriminative evaluation, making the benchmark better suited for assessing the advanced reasoning capabilities of state-of-the-art LLMs.
Rigorous Construction Process: MMLU-Pro's dataset construction involves integrating questions from various high-quality sources and undergoes a rigorous process of filtering and refining, removing overly simple or erroneous questions. This process ensures a diverse and challenging question set for evaluation.
In conclusion, MMLU-Pro's enhancements in terms of answer options, difficulty, performance stability, and construction process contribute to a more challenging and robust benchmark for evaluating the capabilities of large language models.

MMLU-Pro implements several measures to ensure the robustness and reliability of its evaluations:
Increased Number of Answer Options: MMLU-Pro increases the answer options per question from four to ten. This makes the evaluation more realistic and challenging, significantly reducing the score from random guessing.
Higher Difficulty: The new dataset includes more reasoning-focused problems, increasing overall difficulty. Consequently, Chain-of-Thought (CoT) reasoning can outperform Perplexity (PPL) by up to 20%.
Performance Stability: Due to the increased options, model performance on MMLU-Pro is more stable. For example, Llama-2-7B shows less than 1% performance variance with different prompts, compared to 4-5% on the original MMLU.
Two-Round Expert Review Process: MMLU-Pro involves a two-round expert review process to reduce dataset noise and enhance question quality. This novel approach significantly raises the benchmark’s difficulty level and robustness.
Diverse and Challenging Question Set: MMLU-Pro’s dataset construction involves integrating questions from various high-quality sources, including the original MMLU, STEM websites, TheoremQA, and SciBench, ensuring a diverse and challenging question set.
Evaluation Across Different Prompt Styles: The benchmark evaluates models’ performance across 24 different prompt styles to assess robustness and minimize prompt variability impacts3.
These measures help make MMLU-Pro a more robust and challenging benchmark for assessing the advanced reasoning capabilities of state-of-the-art LLMs.