Llama2-13B and Llama3-70B showed significant improvements in performance when evaluated using OLMES. The normalization techniques recommended by OLMES, such as PMI for ARC-Challenge and CommonsenseQA, effectively reduced performance variations. The results indicated that some models reported up to 25% higher accuracy with OLMES than other methods, highlighting the standard's effectiveness in providing fair comparisons.
A standardized evaluation framework is crucial for LLMs to ensure consistent performance measurement, result reproducibility, and fair comparison between different models. It maintains the credibility of AI model performance claims, enabling researchers to understand strengths and weaknesses, and guiding future development and improvements in the field.
OLMES provides specific guidelines for prompt formatting in LLM evaluations. It recommends using a consistent "Question: <question>" prefix and "Answer:" suffix in formatting datasets. OLMES also standardizes the use of a fixed, curated 5-shot prompt for each task, ensuring high-quality examples that cover the label space effectively. Additionally, it specifies using different normalization methods, such as pointwise mutual information (PMI) normalization, for certain tasks to adjust for the inherent likelihood of answer choices.