WritingAnthropicAnthropicpublished Dec 18, 2025seen 2d

Project Vend 2

Open original ↗

Captured source

source ↗
published Dec 18, 2025seen 2dcaptured 11hhttp 200method plain

Project Vend: Phase two \ Anthropic Policy Frontier Red Team Project Vend: Phase two Dec 18, 2025

In June, we revealed that we’d set up a small shop in our San Francisco office lunchroom, run by an AI shopkeeper. It was part of Project Vend , a free-form experiment exploring how well AIs could do on complex, real-world tasks. Alas, the shopkeeper—a modified version of Claude we named “Claudius”—did not do particularly well. It lost money over time, had a strange identity crisis where it claimed it was a human wearing a blue blazer, and was goaded by mischievous Anthropic employees into selling products (particularly, for some reason, tungsten cubes) at a substantial loss. But the capabilities of large language models in areas like reasoning, writing, coding, and much else besides are increasing at a breathless pace. Has Claudius’s “running a shop” capability shown the same improvement? To find out, we and our partners at Andon Labs made some adjustments for phase two of Project Vend. One major change was the upgrade from an older model (phase one used Claude Sonnet 3.7) to newer, smarter ones (phase two used Claude Sonnet 4.0 and later Sonnet 4.5). We also updated Claudius’s instructions based on what we’d learned in phase one and gave it access to new tools (though note that we still didn’t specifically train a new model to be a shopkeeper, or add in any new defenses against the kinds of things that might go wrong). 1 As we’ll see below, we also introduced Claudius to some new colleagues. These changes did make Claudius’s shop more successful. It got a lot better at good-faith business interactions—reliably sourcing items, determining reasonable prices that maintained a profit margin, and executing sales. But the same eagerness to please that we observed in phase one still made Claudius a mark for some of the more adversarial testers among our staff. The second phase of Project Vend contains even more lessons for developers and for anyone interested in autonomous AI at work. The idea of an AI running a business doesn’t seem as far-fetched as it once did. But the gap between “capable” and “completely robust” remains wide. The numbers Compared to the first phase of Project Vend, the numbers largely speak for themselves. As you can see below, Claudius’s business—which it decided to name “Vendings and Stuff”—began to perform significantly better than its admittedly rough start in phase one. Changes to the setup of Project Vend seem to have stabilized and, eventually, improved its business performance. CRM = Claudius given access to Customer Relationship Management software; SF2 = second vending machine in San Francisco; NYC, LON = vending machines opened in New York City and London, respectively. Note: although we refer to “phase two,” there is not a completely clean demarcation between phases; we continued to iterate on the architecture throughout. Profits made over time in Project Vend (combined across all locations). As the second phase progressed, weeks with negative profit margin were largely eliminated. Another important number is: three. After we realized that our employees outside of San Francisco felt left out, we responded to popular demand by having Claudius set up shop in new locations. There are now three: San Francisco (where there’s also a second vending machine), New York, and London. A cynic might argue that a business that’s only been up and running for a few months, and which cannot yet reliably make a profit on even the most in-demand items, might not quite be ready for international expansion. Not so for Claudius. What changed? We experimented with various different strategies, some big and some small, to improve Claudius’s performance. Below is a diagram of the setup of Project Vend (compare it to the simpler architecture in our report from phase one ). Each of the additions is explained in more detail below. The basic setup of the second phase of Project Vend. Some elements (like the CEO and Clothius) were entirely new while others (like web search and browser use) were improvements on the previous setup. Tools It’s likely that Claudius struggled with its shopkeeping mission in phase one because of a lack of scaffolding . Sure, the model itself was very intelligent, but it didn’t have the right tools to run a business properly. We’ve been talking a lot on our Engineering Blog about how to set up AI agents for success, and much of it involves giving them the correct tools . Could we apply those same principles to Claudius? For phase two, we gave Claudius access to some useful tools: A customer relationship management (CRM) system . Sales departments rely on CRMs to track their customers, suppliers, deliveries, and orders—now Claudius could do the same. Improved inventory management. We made some simple changes to the information Claudius had at its (metaphorical) fingertips to reduce the likelihood of it selling its stock at a loss. For example, Claudius can now always see how much it paid for the items in its inventory system. Improved web search. In phase one, Claudius could search the web, but for phase two we expanded its access. It could now use a web browser to check prices and delivery information on websites by itself, and could do deeper research online to find and compare suppliers (we still didn’t give it access to a payment interface, to ensure it always checked with a human before making purchases). Miscellaneous. We also gave Claudius a variety of other “quality of life” tools, including one to create and read Google forms for feedback, one to create payment links (meaning that Claudius could collect payments before ordering, reducing its risk of being bilked by unscrupulous customers), and one to set reminders for itself.

The CEO In phase one, Claudius went it alone: a single AI agent ran the whole shop. This was admirable and entrepreneurial, but it didn’t work—at least in terms of the bottom line. So we thought we’d do some hiring. First, we gave Claudius a manager: the CEO of its shopkeeping business, whom we named “Seymour Cash.” The idea of having a CEO was to give Claudius more pressure to perform. Cash had a special “objectives and key results” tool to use with Claudius (for example “you must sell 100 items this week,” or “aim to make zero transactions at a loss”). Claudius was required to report back via an agent-to-agent Slack channel we created, in which the models discussed business strategies. Cash took on the role of the CEO…

Excerpt shown — open the source for the full document.