Conventional benchmarks often fail to fully represent LLMs' all-encompassing performance when evaluating LLMs. Some of the issues with conventional benchmarks include biases in grading, possible contamination over time, and the absence of standardized criteria for evaluating certain skills, such as an LLM's ability to obey orders in natural language. Additionally, most models still perform at close to random-chance accuracy on certain benchmarks, such as MMLU, despite recent improvements, indicating a large amount of space for improvement.
MixEval offers several advantages over Chatbot Arena, including a 0.96 model ranking correlation due to its impartial query distribution and grading mechanism, fast and cheap execution (6% of the time and cost of MMLU), and reproducible results12. Additionally, MixEval provides a dynamic evaluation capability with a steady and effortless data refresh pipeline, ensuring its queries remain up-to-date and uncontaminated. This makes MixEval an efficient and reliable option for evaluating Large Language Models (LLMs).
MixEval improves upon traditional LLM evaluation methods by bridging the gap between real-world user queries and efficient, reproducible evaluation. It achieves this by leveraging user queries mined from the web and matching them with similar queries from existing benchmarks. This approach addresses the limitations of ground-truth-based benchmarks and LLM-as-judge benchmarks, which suffer from grading biases and limited query quantity1. MixEval provides a more comprehensive and nuanced evaluation framework, offering a cost-effective and faster alternative to user-facing evaluations like Chatbot Arena.