What does this repo signal mean?

Snowflake (Arctic) published Snowflake-Labs/dare-bench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo Snowflake-Labs/dare-bench · language Python · Low traction new repo. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Snowflake (Arctic) Repo: Snowflake-Labs/dare-bench

Captured source

source ↗

GitHub/github.com/Snowflake-Labs/dare-bench

Snowflake-Labs/dare-bench repository metadata

Source ↗

published Feb 24, 2026seen Jun 5captured Jun 11http 200method plain

Snowflake-Labs/dare-bench

Language: Python

License: Apache-2.0

Stars: 11

Forks: 1

Open issues: 2

Created: 2026-02-24T22:29:57Z

Pushed: 2026-04-28T00:43:55Z

Default branch: main

Fork: no

Archived: no

README:

[ICLR 2026] DARE-Bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Figure 1: DARE-Bench task structure. Each task provides a natural-language question and structured files. An LLM agent executes code within a sandbox to generate predictions, which are compared against ground truth for automatic evaluation.

---

Authors

Fan Shu1,\* · Yite Wang2,\*† · Ruofan Wu1 · Boyi Liu2 · Zhewei Yao2 · Yuxiong He2 · Feng Yan1

1University of Houston &emsp; 2Snowflake AI Research

---

Overview

DARE-Bench is a large-scale benchmark for evaluating LLM agents on data science tasks, featuring 6,300 tasks across classification, regression, and time series forecasting. Here we make available a subset of the tasks. DARE-Bench provides:

✅ Verifiable ground truth for objective and reproducible evaluation
🎯 Process-aware instruction following tasks with deterministic outcomes
📊 Large-scale training data to support supervised fine-tuning and reinforcement learning

---

Task Types

| Task Type | Description | Train | Eval | |:--|:--|--:|--:| | Classification-IF | Instruction Following | 807 | 68 | | Classification-MM | ML Modeling | 807 | 68 | | Regression-IF | Instruction Following | 649 | 42 | | Regression-MM | ML Modeling | 649 | 42 | | Time-series-XF | eXogenous Features | 681 | 52 | | Time-series-CF | Canonical Forecasting | 681 | 52 | | | Total | 4,274 | 324 |

> The table above reflects the tasks released in this repository. > > Note: The paper reports results/statistics on the full dataset (Train: 5,948, Eval: 352, Total: 6,300). For the full dataset statistics (as reported in the paper), expand the section below. > > Task type is inferred from the dataset folder suffix: *_class → Classification, *_reg → Regression, *_ts → Time-series. > > See [LICENSE/LICENSE_DISTRIBUTION.md](LICENSE/LICENSE_DISTRIBUTION.md) for the license distribution of the released subset (counted using suggested_license when present; Train: 2,137 source datasets, Eval: 162 source datasets). > > We also provide the license metadata in [LICENSE/kaggle_license_train.csv](LICENSE/kaggle_license_train.csv) and [LICENSE/kaggle_license_test.csv](LICENSE/kaggle_license_test.csv).

Full Dataset (as reported in paper)

| Task Type | Description | Train | Eval | |:--|:--|--:|--:| | Classification-IF | Instruction Following | 1,160 | 74 | | Classification-MM | ML Modeling | 1,160 | 74 | | Regression-IF | Instruction Following | 899 | 45 | | Regression-MM | ML Modeling | 899 | 45 | | Time-series-XF | eXogenous Features | 915 | 57 | | Time-series-CF | Canonical Forecasting | 915 | 57 | | | Total | 5,948 | 352 |

---

Results

Performance of various LLMs on DARE-Bench evaluation tasks (score in %):

| Model | Class-IF | Class-MM | Reg-IF | Reg-MM | TS-XF | TS-CF | |:--|:--:|:--:|:--:|:--:|:--:|:--:| | GPT-4o | 31.86 | 41.10 | 20.63 | 39.74 | 37.09 | 5.24 | | GPT-4.1 | 55.39 | 57.75 | 50.00 | 57.68 | 41.19 | 6.81 | | GPT-5 | 70.10 | 43.18 | 55.56 | 55.21 | 36.78 | 7.84 | | GPT-o4-mini | 68.14 | 59.14 | 51.59 | 57.48 | 42.15 | 8.15 | | Claude-Sonnet-3.7 | 61.52 | 61.22 | 46.03 | 61.36 | 51.21 | 12.08 | | Claude-Sonnet-4 | 14.71 | 17.70 | 14.29 | 10.50 | 5.27 | 0.02 | | Qwen3-32B | 16.67 | 30.92 | 15.08 | 35.42 | 27.26 | 0.00 | | Qwen3-4B | 3.43 | 4.99 | 0.79 | 2.28 | 7.00 | 0.00 |

> The table above reflects the results of the tasks released in this repository. For the full evaluation results (as reported in the paper), expand the section below.

Full Evaluation Results (as reported in paper)

| Model | Class-IF | Class-MM | Reg-IF | Reg-MM | TS-XF | TS-CF | |:--|:--:|:--:|:--:|:--:|:--:|:--:| | GPT-4o | 32.88 | 40.45 | 20.28 | 40.60 | 35.54 | 4.77 | | GPT-4.1 | 55.82 | 57.83 | 52.17 | 58.62 | 40.78 | 6.60 | | GPT-5 | 69.81 | 43.40 | 57.24 | 56.29 | 36.83 | 10.13 | | GPT-o4-mini | 67.56 | 57.89 | 53.62 | 57.60 | 42.29 | 9.67 | | Claude-Sonnet-3.7 | 61.48 | 61.03 | 46.37 | 63.20 | 49.88 | 13.70 | | Claude-Sonnet-4 | 16.21 | 18.27 | 15.21 | 11.33 | 4.80 | 0.01 | | Qwen3-32B | 17.11 | 30.71 | 15.21 | 35.86 | 26.96 | 0.00 | | Qwen3-4B | 3.60 | 5.23 | 0.72 | 3.29 | 6.97 | 0.00 |

Task Variants in JSON

For classification / regression tasks: question_v1 = IF (Instruction Following), question_v2 = MM (ML Modeling)
For time_series_analysis tasks: question_v1 = XF (eXogenous Features), question_v2 = CF (Canonical Forecasting)

---

Data

SFT Rejection Sampling Strategies

---

Quick Start

import json

# Load evaluation tasks
with open("data/eval/question_list.json", "r") as f:
tasks = json.load(f)

# Example: Get a task (classification / regression / time_series_analysis)
task = tasks[0]
question_if = task["question_v1"] # Instruction Following
question_mm = task["question_v2"] # ML Modeling
data_path = f"data/eval/databases/{task['file_path']}"

> Time series tasks: In question_list.json, time series tasks have task = "time_series_analysis" (with variants XF/CF in v1/v2). The provided evaluation script matches predictions by row_id for all tasks (including time series), so your prediction.csv must include row_id and the target column(s) specified in verify/all_metadata.json.

---

Repository Structure

> data/eval/databases/ contains dataset files under their original Kaggle/source licenses. All other files are released under Apache-2.0.

dare-bench/
├── data/
│ └── eval/
│ ├── question_list.json
│ └── databases/
├── LICENSE/
├── LICENSE.txt
├── scripts/
│ ├── evaluation.py # offline...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction new repo