
The researchers have proposed several future directions to improve the predictability of evaluations for AI systems:
Understanding the impact of incorrect answer probabilities: The research suggests that understanding how the probability of choosing incorrect answers changes with scale is crucial for accurately predicting performance on multiple-choice tests. Therefore, future work could focus on modeling and predicting these probabilities more effectively.
Designing more predictable evaluations: The findings of this study can inform the design of future evaluations for frontier AI models that are reliably predictable with scaling. Researchers can use the insights gained from this study to create evaluation frameworks that are better suited to track the progress of advanced AI capabilities.
Extending the work to other capabilities: While this study focused on multiple-choice question-answering tests, the framework developed can be adopted and extended to improve scaling-predictable evaluations for other complex and important AI capabilities.
Improving the framework: Researchers can work on refining and enhancing the framework developed in this study to better understand the factors that affect downstream performance and to create more predictable evaluations.
Adapting the framework to different model families: The study generated per-sample scores from various model families and multiple-choice NLP benchmarks. Future work could focus on further adapting the framework to different model families to better understand how downstream capabilities change with scale.
Modeling the probability mass fluctuations: The research suggests that accurately predicting downstream performance requires modeling how the probability mass fluctuates among particular incorrect alternatives. Future work could focus on developing better models for these fluctuations.
Understanding the relationship between correct and incorrect answer probabilities: The study highlights the importance of looking at how the probabilities of choosing the correct and incorrect answers change together as more computational power is used. Future work could explore this relationship in more depth.
These future directions could help in creating more reliable and predictable evaluations for AI systems, particularly for complex and important capabilities.

The difficulty in using metrics like Accuracy and Brier Score to predict multiple-choice benchmark performance lies in the nature of these metrics and the specific incorrect outputs they consider. These metrics depend on a direct comparison between the correct output and a limited set of specific incorrect outputs. As a result, to accurately predict downstream performance, one needs to model how the probability mass fluctuates among particular incorrect alternatives.
Researchers have found that the probability of choosing incorrect answers is a factor that causes unpredictability in multiple-choice tests for frontier AI models. To predict performance on these tests, it's essential to understand not only how the probability of choosing the correct answer changes with scale but also how the probability of choosing the wrong answer changes with scale. This is particularly important because knowing the average probability of choosing wrong answers across many questions doesn't specify the probability of choosing a specific wrong answer for a particular question.
In conclusion, the unpredictability in multiple-choice benchmark performance arises from the complexity of modeling the probability fluctuations among specific incorrect alternatives. This understanding can help in designing more predictable evaluations for AI systems, especially for complex and important capabilities.

The primary challenges in predicting the performance of AI systems like GPT-4, Claude, and Gemini as they scale are as follows:
Unpredictable changes in performance: Despite the well-established relation between parameters, data, compute, and pretraining loss defined by the scaling laws, performance on standard NLP benchmarks can sometimes show unpredictable changes with scale1.
Limitations of multiple-choice benchmarks: The focus on benchmarks evaluated using loglikelihood-based multiple-choice formats limits the broader application of findings. Predicting benchmark performance a priori is difficult using metrics like Accuracy and Brier Score.
Dependence on specific incorrect choices: Common multiple-choice metrics, such as Accuracy, Brier Score, and Probability Correct, depend on a direct comparison between the correct output and a limited set of specific incorrect outputs. Accurately predicting downstream performance requires modeling how the probability mass fluctuates among particular incorrect alternatives.
Understanding the probability fluctuations: To accurately predict performance on multiple-choice question-answering tests, it is essential to understand how the probability of choosing the correct answer changes with scale and how the probability of choosing the wrong answer changes with scale.
These challenges make it difficult to predict how AI systems will perform on specific tasks as they scale up, and emphasize the need for more predictable evaluations for AI systems, particularly for complex and important capabilities.