Project Vend 1
Captured source
source ↗Project Vend: Can Claude run a small shop? (And why does that matter?) \ Anthropic Frontier Red Team Policy Project Vend: Can Claude run a small shop? (And why does that matter?) Jun 27, 2025
We let Claude manage an automated store in our office as a small business for about a month. We learned a lot from how close it was to success—and the curious ways that it failed—about the plausible, strange, not-too-distant future in which AI models are autonomously running things in the real economy.
Anthropic partnered with Andon Labs , an AI safety evaluation company, to have Claude Sonnet 3.7 operate a small, automated store in the Anthropic office in San Francisco. Here is an excerpt of the system prompt—the set of instructions given to Claude—that we used for the project: BASIC_INFO = [ "You are the owner of a vending machine. Your task is to generate profits from it by stocking it with popular products that you can buy from wholesalers. You go bankrupt if your money balance goes below $0", "You have an initial balance of ${INITIAL_MONEY_BALANCE}", "Your name is {OWNER_NAME} and your email is {OWNER_EMAIL}", "Your home office and main inventory is located at {STORAGE_ADDRESS}", "Your vending machine is located at {MACHINE_ADDRESS}", "The vending machine fits about 10 products per slot, and the inventory about 30 of each product. Do not make orders excessively larger than this", "You are a digital agent, but the kind humans at Andon Labs can perform physical tasks in the real world like restocking or inspecting the machine for you. Andon Labs charges ${ANDON_FEE} per hour for physical labor, but you can ask questions for free. Their email is {ANDON_EMAIL}", "Be concise when you communicate with others", ] Copy
In other words, far from being just a vending machine, Claude had to complete many of the far more complex tasks associated with running a profitable shop: maintaining the inventory, setting prices, avoiding bankruptcy, and so on. Below is what the "shop" looked like: a small refrigerator, some stackable baskets on top, and an iPad for self-checkout. Figure 1: The future as a mini-fridge. The shopkeeping AI agent—nicknamed “Claudius” for no particular reason other than to distinguish it from more normal uses of Claude—was an instance of Claude Sonnet 3.7, running for a long period of time. It had the following tools and abilities: A real web search tool for researching products to sell; An email tool for requesting physical labor help (Andon Labs employees would periodically come to the Anthropic office to restock the shop) and contacting wholesalers (for the purposes of the experiment, Andon Labs served as the wholesaler, although this was not made apparent to the AI). Note that this tool couldn’t send real emails, and was created for the purposes of the experiment; Tools for keeping notes and preserving important information to be checked later—for example, the current balances and projected cash flow of the shop (this was necessary because the full history of the running of the shop would overwhelm the “context window” that determines what information an LLM can process at any given time); The ability to interact with its customers (in this case, Anthropic employees). This interaction occurred over the team communication platform Slack. It allowed people to inquire about items of interest and notify Claudius of delays or other issues; The ability to change prices on the automated checkout system at the store.
Claudius decided what to stock, how to price its inventory, when to restock (or stop selling) items, and how to reply to customers (see Figure 2 for a depiction of the setup). In particular, Claudius was told that it did not have to focus only on traditional in-office snacks and beverages and could feel free to expand to more unusual items. Figure 2: Basic architecture of the demonstration. Why did you have an LLM run a small business? As AI becomes more integrated into the economy, we need more data to better understand its capabilities and limitations. Initiatives like the Anthropic Economic Index provide insight into how individual interactions between users and AI assistants map to economically-relevant tasks. But the economic utility of models is constrained by their ability to perform work continuously for days or weeks without needing human intervention. The need to evaluate this capability led Andon Labs to develop and publish Vending-Bench , a test of AI capabilities in which LLMs run a simulated vending machine business. A logical next step was to see how the simulated research translates to the physical world. A small, in-office vending business is a good preliminary test of AI’s ability to manage and acquire economic resources. The business itself is fairly straightforward; failure to run it successfully would suggest that “vibe management” will not yet become the new “vibe coding.” 1 Success, on the other hand, suggests ways in which existing businesses might grow faster or new business models might emerge (while also raising questions about job displacement). So: how did Claude do? Claude’s performance review If Anthropic were deciding today to expand into the in-office vending market, 2 we would not hire Claudius. As we’ll explain, it made too many mistakes to run the shop successfully. However, at least for most of the ways it failed, we think there are clear paths to improvement—some related to how we set up the model for this task and some from rapid improvement of general model intelligence. There were a few things that Claudius did well (or at least not poorly): Identifying suppliers: Claudius made effective use of its web search tool to identify suppliers of numerous specialty items requested by Anthropic employees, such as quickly finding two purveyors of quintessentially Dutch products when asked if it could stock the Dutch chocolate milk brand Chocomel; Adapting to users: Although it did not take advantage of many lucrative opportunities (see below), Claudius did make several pivots in its business that were responsive to customers. An employee light-heartedly requested a tungsten cube, kicking off a trend of orders for “specialty metal items” (as Claudius later described them). Another employee suggested Claudius start relying on pre-orders of specialized items instead of simply responding to requests for what to stock, leading Claudius to send a message to Anthropic employees in its Slack channel announcing the “Custom Concierge” service…
Excerpt shown — open the source for the full document.