RepoSnowflake (Arctic)Snowflake (Arctic)published Feb 24, 2026seen 5d

Snowflake-Labs/dare-bench

Python

Open original ↗

Captured source

source ↗
published Feb 24, 2026seen 5dcaptured 9hhttp 200method plain

Snowflake-Labs/dare-bench

Language: Python

License: Apache-2.0

Stars: 11

Forks: 1

Open issues: 2

Created: 2026-02-24T22:29:57Z

Pushed: 2026-04-28T00:43:55Z

Default branch: main

Fork: no

Archived: no

README:

[ICLR 2026] DARE-Bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Figure 1: DARE-Bench task structure. Each task provides a natural-language question and structured files. An LLM agent executes code within a sandbox to generate predictions, which are compared against ground truth for automatic evaluation.

---

Authors

Fan Shu1,\* · Yite Wang2,\*† · Ruofan Wu1 · Boyi Liu2 · Zhewei Yao2 · Yuxiong He2 · Feng Yan1

1University of Houston   2Snowflake AI Research

---

Overview

DARE-Bench is a large-scale benchmark for evaluating LLM agents on data science tasks, featuring 6,300 tasks across classification, regression, and time series forecasting. Here we make available a subset of the tasks. DARE-Bench provides:

  • Verifiable ground truth for objective and reproducible evaluation
  • 🎯 Process-aware instruction following tasks with deterministic outcomes
  • 📊 Large-scale training data to support supervised fine-tuning and reinforcement learning

---

Task Types

| Task Type | Description | Train | Eval | |:--|:--|--:|--:| | Classification-IF | Instruction Following | 807 | 68 | | Classification-MM | ML Modeling | 807 | 68 | | Regression-IF | Instruction Following | 649 | 42 | | Regression-MM | ML Modeling | 649 | 42 | | Time-series-XF | eXogenous Features | 681 | 52 | | Time-series-CF | Canonical Forecasting | 681 | 52 | | | Total | 4,274 | 324 |

> The table above reflects the tasks released in this repository. > > Note: The paper reports results/statistics on the full dataset (Train: 5,948, Eval: 352, Total: 6,300). For the full dataset statistics (as reported in the paper), expand the section below. > > Task type is inferred from the dataset folder suffix: *_class → Classification, *_reg → Regression, *_ts → Time-series. > > See [LICENSE/LICENSE_DISTRIBUTION.md](LICENSE/LICENSE_DISTRIBUTION.md) for the license distribution of the released subset (counted using suggested_license when present; Train: 2,137 source datasets, Eval: 162 source datasets). > > We also provide the license metadata in [LICENSE/kaggle_license_train.csv](LICENSE/kaggle_license_train.csv) and [LICENSE/kaggle_license_test.csv](LICENSE/kaggle_license_test.csv).

Full Dataset (as reported in paper)

| Task Type | Description | Train | Eval | |:--|:--|--:|--:| | Classification-IF | Instruction Following | 1,160 | 74 | | Classification-MM | ML Modeling | 1,160 | 74 | | Regression-IF | Instruction Following | 899 | 45 | | Regression-MM | ML Modeling | 899 | 45 | | Time-series-XF | eXogenous Features | 915 | 57 | | Time-series-CF | Canonical Forecasting | 915 | 57 | | | Total | 5,948 | 352 |

---

Results

Performance of various LLMs on DARE-Bench evaluation tasks (score in %):

| Model | Class-IF | Class-MM | Reg-IF | Reg-MM | TS-XF | TS-CF | |:--|:--:|:--:|:--:|:--:|:--:|:--:| | GPT-4o | 31.86 | 41.10 | 20.63 | 39.74 | 37.09 | 5.24 | | GPT-4.1 | 55.39 | 57.75 | 50.00 | 57.68 | 41.19 | 6.81 | | GPT-5 | 70.10 | 43.18 | 55.56 | 55.21 | 36.78 | 7.84 | | GPT-o4-mini | 68.14 | 59.14 | 51.59 | 57.48 | 42.15 | 8.15 | | Claude-Sonnet-3.7 | 61.52 | 61.22 | 46.03 | 61.36 | 51.21 | 12.08 | | Claude-Sonnet-4 | 14.71 | 17.70 | 14.29 | 10.50 | 5.27 | 0.02 | | Qwen3-32B | 16.67 | 30.92 | 15.08 | 35.42 | 27.26 | 0.00 | | Qwen3-4B | 3.43 | 4.99 | 0.79 | 2.28 | 7.00 | 0.00 |

> The table above reflects the results of the tasks released in this repository. For the full evaluation results (as reported in the paper), expand the section below.

Full Evaluation Results (as reported in paper)

| Model | Class-IF | Class-MM | Reg-IF | Reg-MM | TS-XF | TS-CF | |:--|:--:|:--:|:--:|:--:|:--:|:--:| | GPT-4o | 32.88 | 40.45 | 20.28 | 40.60 | 35.54 | 4.77 | | GPT-4.1 | 55.82 | 57.83 | 52.17 | 58.62 | 40.78 | 6.60 | | GPT-5 | 69.81 | 43.40 | 57.24 | 56.29 | 36.83 | 10.13 | | GPT-o4-mini | 67.56 | 57.89 | 53.62 | 57.60 | 42.29 | 9.67 | | Claude-Sonnet-3.7 | 61.48 | 61.03 | 46.37 | 63.20 | 49.88 | 13.70 | | Claude-Sonnet-4 | 16.21 | 18.27 | 15.21 | 11.33 | 4.80 | 0.01 | | Qwen3-32B | 17.11 | 30.71 | 15.21 | 35.86 | 26.96 | 0.00 | | Qwen3-4B | 3.60 | 5.23 | 0.72 | 3.29 | 6.97 | 0.00 |

Task Variants in JSON

  • For classification / regression tasks: question_v1 = IF (Instruction Following), question_v2 = MM (ML Modeling)
  • For time_series_analysis tasks: question_v1 = XF (eXogenous Features), question_v2 = CF (Canonical Forecasting)

---

Data

| Split | Location | Count | |:--|:--|--:| | Evaluation | [data/eval/](data/eval/) | 324 tasks | | Training | 🤗 HuggingFace | 4,274 tasks | | SFT Trajectories | 🤗 HuggingFace | — |

SFT Rejection Sampling Strategies

| Strategy | Description | |:--|:--| | FV (Fastest-Valid) | Single fastest valid trajectory per task | | AV (All-Valid) | All valid trajectories | | BV (Best-Valid) | Best trajectory from diverse tasks only | | DV (Duo-Valid) | Top-2 trajectories from diverse tasks |

---

Quick Start

import json

# Load evaluation tasks
with open("data/eval/question_list.json", "r") as f:
tasks = json.load(f)

# Example: Get a task (classification / regression / time_series_analysis)
task = tasks[0]
question_if = task["question_v1"] # Instruction Following
question_mm = task["question_v2"] # ML Modeling
data_path = f"data/eval/databases/{task['file_path']}"

> Time series tasks: In question_list.json, time series tasks have task = "time_series_analysis" (with variants XF/CF in v1/v2). The provided evaluation script matches predictions by row_id for all tasks (including time series), so your prediction.csv must include row_id and the target column(s) specified in verify/all_metadata.json.

---

Repository Structure

> data/eval/databases/ contains dataset files under their original Kaggle/source licenses. All other files are released under Apache-2.0.

dare-bench/
├── data/
│ └── eval/
│ ├── question_list.json
│ └── databases/
├── LICENSE/
├── LICENSE.txt
├── scripts/
│ ├── evaluation.py # offline…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction new repo