kwaipilot/experiments
forked from SWE-bench/experiments
Captured source
source ↗kwaipilot/experiments
Description: Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-09-10T07:18:03Z
Pushed: 2025-08-28T17:46:13Z
Default branch: main
Fork: yes
Parent repository: SWE-bench/experiments
Archived: no
README:
SWE-bench Experiments
This repository contains records of submissions to the SWE-bench leaderboard.
How is this repository organized?
experiments/ ├── evaluation/ │ ├── lite/ │ ├── verified/ │ ├── multimodal/ │ ├── multilingual/ │ └── test/ | ├── _ │ │ ├── all_preds.jsonl │ │ ├── metadata.yaml │ │ ├── README.md │ │ ├── logs// (Execution Logs) │ │ └── trajs/*.traj (Reasoning Traces) │ └── ... └── validation/ ├── dev └── test
Top level directories in evaluation/ are different splits of SWE-bench (lite, test, verified) and SWE-bench Multimodal.
- Each subfolder is a submission to that benchmark.
- A subfolder contains the predictions, results, execution logs, and trajectories (if applicable) for the submission.
The validation/ folder contains the validation logs for the dev and test splits of SWE-bench. Each of these top level folders consist of repo-level subfolders (e.g. pallets/flask is a test split repository, so there is a flask/ folder under validation/test/). The validation/test_202404 is a re-run of validation performed April 2024 to ensure reproducibility of task instances' behavior since SWE-bench was created in September 2023 (You can read more about the re-run here).
These logs are publicly accessible and meant to enable greater reproducibility and transparency of the experiments conducted on the SWE-bench task.
🔎 Viewing Logs, Trajectories
You can download the logs and trajectories for each submission by running the following command to download the data:
python -m analysis.download_logs evaluation// python -m analysis.download_logs evaluation/lite/20231010_rag_claude2
Logs and trajectories are saved to a public S3 Bucket. *You need an AWS account to download the logs and trajectories*. Namely, you'll need to create an AWS account, download the AWS CLI, and configure the CLI with your credentials.
🏆 Leaderboard Participation
To evaluate on SWE-bench, check out the main repository for instructions. You have two options:
- (Recommended) Use our sb-cli tool for fast evaluations on the cloud.
- Run locally with the main repository.
Please follow these instructions carefully to ensure your submission is merged on time!
SWE-bench [Lite, Verified, Multilingual]
1. Fork this repository 2. Under the split that you evaluate on (e.g. evaluation/lite/), create a new folder with the submission date and the model name (e.g. 20240415_sweagent_gpt4). 3. Within the folder (evaluation//), please provide the following:
📋 Required Assets
all_preds.jsonlorpreds.json: Model predictionsmetadata.yaml: Seechecklist.mdREADME.md: Seechecklist.mdtrajs/: Reasoning traces reflecting how your system solved each task instance (see below for more details)logs/: SWE-bench evaluation artifacts dump- Eval. artifacts means 300/500/300/2294 (Lite/Verified/Multilingual/Test) folders. Each folder (e.g.
astropy__astropy-1234) contains: patch.diff: The model's generated predictionreport.json: Summary of evaluation outcomes for this instancetest_output.txt: An output of runningeval.shonpatch.diff- (Not necessary)
eval.sh: The evaluation script - (Not necessary)
run_instance.log: A log of SWE-bench evaluation steps - NOTE: You shouldn't have to create any of these files. They should automatically be generated by SWE-bench evaluation.
4. Run python -m analysis.get_results evaluation//. ⚠️ This will remove all files in the directory that aren't required for the submission. 5. Create a pull request to this repository with the new folder.
> [!NOTE] > Example of a well-formatted submission
SWE-bench Multimodal
Follow the instructions here.
> [!NOTE] > * SWE-bench Multimodal predictions can *only* be evaluated using sb-cli. > * You do *not* need to submit predictions, logs/, or trajs/ for SWE-bench Multimodal. > * Please follow the instructions for metadata.yaml and README.md as discussed in the checklist.md
✅ Result Verification
If you are interested in receiving the "verified" checkmark on your submission, please do the following: 1. Create an issue 2. In the issue, provide us instructions on how to run your model on SWE-bench. 3. We will run your model on a random subset of SWE-bench and verify the results.
💭 Reasoning Traces
(7/29/2024) We have updated the SWE-bench leaderboard submission criteria to require the inclusion of *reasoning traces*. The goal of this requirement is to provide the community with more insight into how cutting edge methods work without requiring a code release. (although the latter is still highly encouraged!)
What is a reasoning trace?
A reasoning trace is a text-based file that describes the steps your system took to solve a task instance. It should provide a detailed account of the reasoning process that your system used to arrive at its solution.
We purposely do not explicitly define reasoning traces in a strict, explicit format.
We do have some guidelines. the reasoning trace should be...
- Human-readable.
- Reflects the intermediate steps your system took that led to the final solution.
- Generated *with* the inference process, not post-hoc.
We do not require reasoning traces to be...
- In a specific file format (e.g.
json,yaml,md) - Conform to a specific problem solving style (e.g. agentic, procedural, etc.)
A simple solution to this? When running inference, simply log the intermediate output generated by your system. For an example, see [SWE-agent + GPT 4 Turbo…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork, no traction or notable content