Swe Bench Sonnet
Captured source
source ↗Claude SWE-Bench Performance \ Anthropic Engineering at Anthropic Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Published Jan 06, 2025 SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks.
Our latest model, the upgraded Claude 3.5 Sonnet , achieved 49% on SWE-bench Verified, a software engineering evaluation, beating the previous state-of-the-art model's 45%. This post explains the "agent" we built around the model, and is intended to help developers get the best possible performance out of Claude 3.5 Sonnet. SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks. Specifically, it tests how the model can resolve GitHub issues from popular open-source Python repositories. For each task in the benchmark, the AI model is given a set up Python environment and the checkout (a local working copy) of the repository from just before the issue was resolved. The model then needs to understand, modify, and test the code before submitting its proposed solution. Each solution is graded against the real unit tests from the pull request that closed the original GitHub issue. This tests whether the AI model was able to achieve the same functionality as the original human author of the PR. SWE-bench doesn't just evaluate the AI model in isolation, but rather an entire "agent" system. In this context, an "agent" refers to the combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is incorporated into its next prompt. The performance of an agent on SWE-bench can vary significantly based on this scaffolding, even when using the same underlying AI model. There are many other benchmarks for the coding abilities of Large Language Models, but SWE-bench has gained in popularity for several reasons: It uses real engineering tasks from actual projects, rather than competition- or interview-style questions; It is not yet saturated—there’s plenty of room for improvement. No model has yet crossed 50% completion on SWE-bench Verified (though the updated Claude 3.5 Sonnet is, at the time of writing, at 49%); It measures an entire "agent", rather than a model in isolation. Open-source developers and startups have had great success in optimizing scaffoldings to greatly improve the performance around the same model.
Note that the original SWE-bench dataset contains some tasks that are impossible to solve without additional context outside of the GitHub issue (for example, about specific error messages to return). SWE-bench-Verified is a 500 problem subset of SWE-bench that has been reviewed by humans to make sure they are solvable, and thus provides the most clear measure of coding agents' performance. This is the benchmark to which we’ll refer in this post. Achieving state-of-the-art Tool Using Agent Our design philosophy when creating the agent scaffold optimized for updated Claude 3.5 Sonnet was to give as much control as possible to the language model itself, and keep the scaffolding minimal. The agent has a prompt, a Bash Tool for executing bash commands, and an Edit Tool, for viewing and editing files and directories. We continue to sample until the model decides that it is finished, or exceeds its 200k context length. This scaffold allows the model to use its own judgment of how to pursue the problem, rather than be hardcoded into a particular pattern or workflow. The prompt outlines a suggested approach for the model, but it’s not overly long or too detailed for this task. The model is free to choose how it moves from step to step, rather than having strict and discrete transitions. If you are not token-sensitive, it can help to explicitly encourage the model to produce a long response. The following code shows the prompt from our agent scaffold:
{location}
I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:
{pr_description}
Can you help me implement the necessary changes to the repository so that the requirements specified in the are met? I've already taken care of all changes to any of the test files described in the . This means you DON'T have to modify the testing logic or any of the tests in any way!
Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the is satisfied.
Follow these steps to resolve the issue: 1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure. 2. Create a script to reproduce the error and execute it with python using the BashTool, to confirm the error 3. Edit the sourcecode of the repo to resolve the issue 4. Rerun your reproduce script and confirm that the error is fixed! 5. Think about edgecases and make sure your fix handles them as well
Your thinking should be thorough and so it's fine if it's very long. Copy
The model's first tool executes Bash commands. The schema is simple, taking only the command to be run in the environment. However, the description of the tool carries more weight. It includes more detailed instructions for the model, including escaping inputs, lack of internet access, and how to run commands in the background. Next, we show the spec for the Bash Tool: { "name": "bash", "description": "Run commands in a bash shell\n
- When invoking this tool, the contents of the \"command\" parameter does NOT need to be XML-escaped.\n
- You don't have access to the internet via this tool.\n
- You do have access to a mirror of common linux and python packages via apt and pip.\n
- State is persistent across command calls and discussions with the user.\n
- To inspect a particular line range of a file, e.g. lines 10-25, try 'sed -n 10,25p /path/to/the/file'.\n
- Please avoid commands that may produce a very large amount of output.\n
- Please run long lived commands in the background, e.g. 'sleep 10 &' or start a server in the background.",
"input_schema": { "type": "object", "properties": { "command": { "type": "string", "description": "The bash command to run." } }, "required": ["command"] } } Copy
The model's second tool (the…
Excerpt shown — open the source for the full document.