
The primary goal of DISCOVERYBENCH is to systematically evaluate the capabilities of state-of-the-art large language models (LLMs) in automated data-driven discovery3. It aims to address the challenges of diversity in real-world data-driven discovery across various domains by introducing a pragmatic formalization, providing a comprehensive benchmark for assessing and improving LLM performance in this area3.

DISCOVERYBENCH differs from previous datasets as it incorporates scientific semantic reasoning, including deciding on appropriate analysis techniques for specific domains, data cleaning and normalization, and mapping goal terms to dataset variables4. It also focuses on the entire discovery process, including ideation, semantic reasoning, and pipeline design, making it the first large-scale dataset to explore LLMs' capacity for the entire discovery process.