WritingAnthropicAnthropicpublished Jan 14, 2026seen 1w

Property Based Testing

Open original ↗

Captured source

source ↗
published Jan 14, 2026seen 1wcaptured 1whttp 200method plain

Finding bugs with Claude and property-based testing \ Anthropic Frontier Red Team Finding bugs across the Python ecosystem with Claude and property-based testing Jan 14, 2026

Muhammad Maaz 1,2 , Liam DeVoe 3 , Zac Hatfield-Dodds 2 , Nicholas Carlini 2 1 MATS, 2 Anthropic, 3 Northeastern University We developed an agent that can efficiently identify bugs in large software projects. To do this, our agent infers general properties of code that should be true, and then by applying property-based testing—a technique similar to fuzz testing—we are able to discover bugs in top Python packages like NumPy, SciPy, and Pandas. After extensive manual validation, we are in the process of reporting these bugs to the developers, several of which have already been patched. For more information, read the full paper , take a look at the GitHub repository , or browse the bugs we found at our site . Introduction Ensuring that programs are bug-free is one of the most challenging aspects of software engineering. Bugs frequently continue to lurk despite the developer’s best effort. The most common approach to testing code is with example-based tests : the developer writes out a specific example use-case, and then verifies that the actual output matches the expected output. For example, a developer might verify that a sort function called on the list [2, 10, 5, 4] would return the output [2, 4, 5, 10]. However, exhaustively covering a program with tests like this is challenging: bugs frequently remain in an edge case the developer did not test. After all, if a developer does not think to test an edge case, it is also likely the developer did not consider that case in the implementation! In contrast, property-based testing  is a software testing paradigm that aims to test whether a general property of the code holds for all (or most) inputs. A developer specifies a property or invariant about the program—for example, that JSON deserialization is the inverse of serialization—as well as a description of the kinds of inputs the property accepts (e.g., any JSON serializable objects). The property-based testing framework then automatically searches for a counterexample of this property by generating valid inputs as test cases using techniques similar to fuzzing. Since the developer specifies the general input domain, and not individual test cases, property-based testing frees developers from thinking of every edge case and allows them to operate at a higher level of abstraction. In our paper that we presented at the 2025 NeurIPS Deep Learning for Code  Workshop that was the result of a MATS project, we developed an AI agent that autonomously writes property-based tests for existing code. We directed the agent to discover properties by reading type annotations, docstrings, function names, and comments. The agent then wrote corresponding property-based tests using Hypothesis. For this work, we focused on the problem of identifying bugs in general, and not just security vulnerabilities. Many classes of logic bugs that cause security vulnerabilities are amenable to property-based testing. For example, in a recent blog post  we focused on identifying bugs in smart contracts ; almost all of those vulnerabilities are the result of logic bugs. In the future we imagine that it may be possible to apply techniques such as the one we describe here to proactively identify bugs before deployment. We used our agent and discovered hundreds of potential bugs in popular open-source Python repositories like NumPy, SciPy, and Pandas. In order to responsibly disclose these bugs and to ensure we don’t unnecessarily burden maintainers, we carefully reviewed each bug. The review process we used is more laborious than we would otherwise implement for our own code reviews, but we strongly preferred to reduce the number of false positives. Our process was as follows: first, we only selected the highest priority bugs for review. We sent these potential bugs to be reviewed by three expert humans (for an average of one hour of review per bug). We then discarded any bug where any of the three manual reviewers were uncertain of its validity. Finally, we (the authors of this blog) manually reviewed each of the candidate bugs. Only if we were also confident in its correctness did we then manually file an issue with the maintainer of the repository. We have already filed several of these bug reports, and are in the process of filing many more. We’ve made available all of our data, including bugs that have not yet been validated, and, for completeness, even bugs that we determined to be invalid, so that maintainers can look at them at this site . Over the coming weeks we intend to file many of these remaining bugs (after additional validation), as well as expand our project to additional PyPI projects.

example-based test

def test_sort(): assert my_sort([1,3,2]) == [1,2,3] assert my_sort([1,0,-5]) == [-5,0,1]

property-based test in Hypothesis

from hypothesis import given, strategies as st

@given(st.lists(st.integers())) def test_sort(lst): result = my_sort(lst) for i in range(len(result)-1): assert result[i] <= result[i+1]

Figure 1. Two ways of testing code. An example-based unit test verifies specific manually-specified inputs. A property-based test, in contrast, specifies a general property (e.g., the definition of a sorted list) and relies on the framework to automatically construct inputs that might fail this property. Our Property-Based Testing Agent Our property-based testing agent is built as a custom Claude Code command. The agent takes a single argument, pointing it towards a particular target, either a single Python file (e.g., normalizers.py), a module (e.g., numpy, scipy.signal), or a function (e.g., requests.get, json.loads). The agent then follows this process to identify potential bugs: Read and understand the target, by reading code, pulling relevant documentation, and exploring how it relates to the rest of the codebase. Propose properties grounded in these findings. Write corresponding property-based tests in Hypothesis. Run the tests and reflect: if it failed, has the test truly discovered a bug, or does the test need to be adjusted? If it succeeded, is the test testing anything worthwhile, or is it simply trivial? If the agent is confident it has found a real bug, it writes out a formatted bug report.

Figure 2. The workflow of the agent. To help the agent perform long-range multi-step reasoning, we direct it to...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive technical post on testing methodology by leading AI lab.