ForkStreamLake (Kuaishou)StreamLake (Kuaishou)published Sep 10, 2025seen 5d

kwaipilot/experiments

forked from SWE-bench/experiments

Open original ↗

Captured source

source ↗
published Sep 10, 2025seen 5dcaptured 9hhttp 200method plain

kwaipilot/experiments

Description: Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-09-10T07:18:03Z

Pushed: 2025-08-28T17:46:13Z

Default branch: main

Fork: yes

Parent repository: SWE-bench/experiments

Archived: no

README:

SWE-bench Experiments

This repository contains records of submissions to the SWE-bench leaderboard.

How is this repository organized?

experiments/
├── evaluation/
│ ├── lite/
│ ├── verified/
│ ├── multimodal/
│ ├── multilingual/
│ └── test/
| ├── _
│ │ ├── all_preds.jsonl
│ │ ├── metadata.yaml
│ │ ├── README.md
│ │ ├── logs// (Execution Logs)
│ │ └── trajs/*.traj (Reasoning Traces)
│ └── ...
└── validation/
├── dev
└── test

Top level directories in evaluation/ are different splits of SWE-bench (lite, test, verified) and SWE-bench Multimodal.

  • Each subfolder is a submission to that benchmark.
  • A subfolder contains the predictions, results, execution logs, and trajectories (if applicable) for the submission.

The validation/ folder contains the validation logs for the dev and test splits of SWE-bench. Each of these top level folders consist of repo-level subfolders (e.g. pallets/flask is a test split repository, so there is a flask/ folder under validation/test/). The validation/test_202404 is a re-run of validation performed April 2024 to ensure reproducibility of task instances' behavior since SWE-bench was created in September 2023 (You can read more about the re-run here).

These logs are publicly accessible and meant to enable greater reproducibility and transparency of the experiments conducted on the SWE-bench task.

🔎 Viewing Logs, Trajectories

You can download the logs and trajectories for each submission by running the following command to download the data:

python -m analysis.download_logs evaluation//
python -m analysis.download_logs evaluation/lite/20231010_rag_claude2

Logs and trajectories are saved to a public S3 Bucket. *You need an AWS account to download the logs and trajectories*. Namely, you'll need to create an AWS account, download the AWS CLI, and configure the CLI with your credentials.

🏆 Leaderboard Participation

To evaluate on SWE-bench, check out the main repository for instructions. You have two options:

  • (Recommended) Use our sb-cli tool for fast evaluations on the cloud.
  • Run locally with the main repository.

Please follow these instructions carefully to ensure your submission is merged on time!

SWE-bench [Lite, Verified, Multilingual]

1. Fork this repository 2. Under the split that you evaluate on (e.g. evaluation/lite/), create a new folder with the submission date and the model name (e.g. 20240415_sweagent_gpt4). 3. Within the folder (evaluation//), please provide the following:

📋 Required Assets

  • all_preds.jsonl or preds.json: Model predictions
  • metadata.yaml: See checklist.md
  • README.md: See checklist.md
  • trajs/: Reasoning traces reflecting how your system solved each task instance (see below for more details)
  • logs/: SWE-bench evaluation artifacts dump
  • Eval. artifacts means 300/500/300/2294 (Lite/Verified/Multilingual/Test) folders. Each folder (e.g. astropy__astropy-1234) contains:
  • patch.diff: The model's generated prediction
  • report.json: Summary of evaluation outcomes for this instance
  • test_output.txt: An output of running eval.sh on patch.diff
  • (Not necessary) eval.sh: The evaluation script
  • (Not necessary) run_instance.log: A log of SWE-bench evaluation steps
  • NOTE: You shouldn't have to create any of these files. They should automatically be generated by SWE-bench evaluation.

4. Run python -m analysis.get_results evaluation//. ⚠️ This will remove all files in the directory that aren't required for the submission. 5. Create a pull request to this repository with the new folder.

> [!NOTE] > Example of a well-formatted submission

SWE-bench Multimodal

Follow the instructions here.

> [!NOTE] > * SWE-bench Multimodal predictions can *only* be evaluated using sb-cli. > * You do *not* need to submit predictions, logs/, or trajs/ for SWE-bench Multimodal. > * Please follow the instructions for metadata.yaml and README.md as discussed in the checklist.md

✅ Result Verification

If you are interested in receiving the "verified" checkmark on your submission, please do the following: 1. Create an issue 2. In the issue, provide us instructions on how to run your model on SWE-bench. 3. We will run your model on a random subset of SWE-bench and verify the results.

💭 Reasoning Traces

(7/29/2024) We have updated the SWE-bench leaderboard submission criteria to require the inclusion of *reasoning traces*. The goal of this requirement is to provide the community with more insight into how cutting edge methods work without requiring a code release. (although the latter is still highly encouraged!)

What is a reasoning trace?

A reasoning trace is a text-based file that describes the steps your system took to solve a task instance. It should provide a detailed account of the reasoning process that your system used to arrive at its solution.

We purposely do not explicitly define reasoning traces in a strict, explicit format.

We do have some guidelines. the reasoning trace should be...

  • Human-readable.
  • Reflects the intermediate steps your system took that led to the final solution.
  • Generated *with* the inference process, not post-hoc.

We do not require reasoning traces to be...

  • In a specific file format (e.g. json, yaml, md)
  • Conform to a specific problem solving style (e.g. agentic, procedural, etc.)

A simple solution to this? When running inference, simply log the intermediate output generated by your system. For an example, see [SWE-agent + GPT 4 Turbo…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no traction or notable content