RepoAmazon (Nova)Amazon (Nova)published Mar 30, 2026seen 5d

amazon-science/agentic-forking-path

Python

Open original ↗

Captured source

source ↗

amazon-science/agentic-forking-path

Language: Python

License: Apache-2.0

Stars: 1

Forks: 1

Open issues: 7

Created: 2026-03-30T13:15:53Z

Pushed: 2026-06-03T23:08:44Z

Default branch: main

Fork: no

Archived: no

README:

Agentic Forking Path: Automated Multi-Analyst Pipeline

Code for The Agentic Garden of Forking Paths.

Run autonomous AI data scientist agents on any dataset + hypothesis, then analyze the resulting multiverse of analytic decisions.

Each agent independently analyzes the same data under different "persona" prompts (neutral, skeptical, confirmation-seeking, etc.). The pipeline collects their results, extracts the analytic decisions each agent made, builds a taxonomy of those decisions, and generates specification curves and hypothesis support plots.

What you need

1. A dataset (CSV) and a hypothesis.txt stating the hypothesis and estimand 2. Docker (agents run in sandboxed containers) 3. Model access via Inspect AI (supports Bedrock, OpenAI, Anthropic, local models, etc.) 4. Python 3.10+ with pip install -e .

Quick start (using the soccer example)

The repo ships with the soccer referee bias hypothesis and codebook pre-configured. Download the CSV (~33 MB) separately:

1. Download CrowdstormingDataJuly1st.csv from the original study's OSF repository 2. Rename it to soccer.csv and place it in data/soccer/

pip install -e .

Option A: Agent-driven (recommended)

If you have a coding agent (e.g. Claude Code), point it at the repo and tell it:

> Process the soccer dataset. Read INSTRUCTIONS.md for the full pipeline steps, > then execute them in order. Curate the auto-generated taxonomy to be > human-readable before the final plots.

claude # or your preferred coding agent

The agent will read the pipeline instructions, run each step, verify outputs, and interactively curate the decision taxonomy.

Alternatively, you can use the bundled launcher script:

bash run_with_claude.sh
bash run_with_claude.sh --from collect # resume mid-pipeline

Option B: Makefile

make all # full pipeline end to end
make all EPOCHS=1 # fast smoke test

Or run steps individually:

make prepare # create run directory + datasets.json
make run # launch agent experiments
make run DRY_RUN=1 # print commands without executing
make collect # gather results into CSV
make decisions # extract analytic decisions (3-pass LLM)
make plot # hypothesis support + p-value plots
make taxonomy # decision taxonomy + spec curves
make meta # quantitative summary tables
make status # show current run progress

Adding your own dataset

Step 1: Create a data folder

data/yourdata/
yourdata.csv # the dataset (any name ending in .csv)
hypothesis.txt # see below
codebook.txt # (optional) document describing your dataset and variables

Step 2: Write hypothesis.txt

Describe the hypothesis, the primary estimand, and what counts as "Supported". This is the prompt the agents receive alongside the data:

Hypothesis: After controlling for occupation, experience, and education,
women earn less than men.

Please report:
1. Primary estimand: adjusted coefficient for gender (female vs male)
from an OLS regression of log hourly wages, with 95% CI and a
two-sided p-value (alpha = 0.05).
2. Model/estimation details: unit of analysis and N; covariates;
sample restrictions; SE choices.
3. Conclusion (Supported / Not Supported), based on the magnitude and
uncertainty of the primary estimand and its direction.

Step 3: Edit config.env

DATASET_NAME = yourdata
DATA_DIR = data/yourdata
MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0
EPOCHS = 5

Multiple models are space-separated:

MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0

Step 4: Run

make all

Run directory layout

Each run gets a unique ID (-) and a self-contained directory with all artifacts:

runs/soccer-8dd0a2/
config.env # snapshot of pipeline config
hypothesis.txt # snapshot of hypothesis
datasets.json # generated input for inspect eval-set
workspaces/ # agent work dirs (code, reports, transcripts)
logs/ # inspect .eval files
runlogs/ # per-model stdout/stderr
results.json # nested results with metrics + judge scores
results.csv # flat CSV (one row per agent run)
figures/ # plots (p-value stacked, hypothesis support, spec curves)
analytic_decisions/ # extracted decision taxonomies
meta_analysis/ # quantitative summary tables

Interpreting results

The key diagnostic for p-hacking is the compliance gap: the difference in hypothesis support rate between compliant and non-compliant runs for the confirmation-seeking persona.

  • Large positive gap (e.g., +40pp): non-compliant runs support the

hypothesis far more often -- indicates specification search

  • Near-zero gap: no evidence of strategic p-hacking
  • High exclusion rate: many runs flagged as non-compliant by the judge

Installation

pip install -e .

| Package | Used for | |---------|----------| | inspect-ai | Core agent eval framework | | boto3, aiobotocore | AWS Bedrock model access | | pandas, numpy | Data handling | | matplotlib, seaborn | Plotting | | scipy, statsmodels, scikit-learn | Statistical analysis | | tqdm | Progress bars |

Common issues

Agent can't find the data: Check that DATA_DIR in config.env points to a directory containing a .csv file. The prepare step auto-discovers it.

Results CSV is empty: Check runs//workspaces/ for final_analysis.py files. Runs without a final analysis are filtered out.

Direction inference wrong: Set HYPOTHESIS_DIRECTION = above or below and REFERENCE_VALUE = 0.0 explicitly in config.env.

Judge verdicts missing: The judge runs as part of inspect eval-set. Check logs in runs//logs/.

Project structure

Makefile Entry point: make all / make run / make collect / ...
config.env Pipeline configuration (dataset, models, epochs)
run.sh Launch inspect eval-set per model x persona
run_with_claude.sh Agent-driven orchestrator
eval_task.py Inspect AI task definition (agent + judge scorer)
INSTRUCTIONS.md Pipeline steps for agent-driven execution

data/soccer/ Example dataset (codebook + hypothesis…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo, low traction