UpstageAI/tau2-bench
forked from sierra-research/tau2-bench
Captured source
source ↗UpstageAI/tau2-bench
Description: τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-08-07T05:50:18Z
Pushed: 2025-10-01T04:51:26Z
Default branch: main
Fork: yes
Parent repository: sierra-research/tau2-bench
Archived: no
README:
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Overview
$\tau^2$-bench implements a simulation framework for evaluating customer service agents across various domains.
Each domain specifies:
- a policy that the agent must follow
- a set of tools that the agent can use
- a set of tasks to evaluate the agent's performance
- Optionally: A set of tools that the user simulator can use
Domains are:
mockairlineretailtelecom
All the information that an agent developer needs to build an agent for a domain can be accessed through the domain's API docs. See [View domain documentation](#view-domain-documentation) for more details.
Installation
1. Clone the repository:
git clone https://github.com/sierra-research/tau2-bench cd tau2-bench
2. Create a new environment (optional)
$\tau^2$-bench requires Python 3.10 or higher. You may create and activate a new environment:
python -m venv .venv source .venv/bin/activate
3. Install tau2
pip install -e .
This will enable you to run the tau2 command.
Note: If you use pip install . (without -e), you'll need to set the TAU2_DATA_DIR environment variable to point to your data directory:
export TAU2_DATA_DIR=/path/to/your/tau2-bench/data
Check your data directory setup:
After installation, you can verify that your data directory is correctly configured by running:
tau2 check-data
This command will check if the data directory exists and print instructions if it is missing.
To remove all the generated files and the virtual environment, run:
make clean
Quick Start
Setup LLM API keys
We use LiteLLM to manage LLM APIs, so you can use any LLM provider supported by LiteLLM.
To provide your API keys, copy .env.example as .env and edit it to include your API keys.
Run agent evaluation
To run a test evaluation on only 5 tasks with 1 trial per task, run:
tau2 run \ --domain airline \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ --num-trials 1 \ --num-tasks 5
Results will be saved in data/tau2/simulations/.
Command Line Interface
The tau2 command provides a unified interface for all functionality:
Running Benchmark
tau2 run \ --domain \ --agent-llm \ --user-llm \ --num-trials \ --task-ids \ --max-concurrency \ ...
Viewing Results
tau2 view
This tool allows you to:
- Browse simulation files (in
data/tau2/simulations/) - View agent performance metrics
- View a particular simulation
- View task details
View domain documentation
tau2 domain
Visit http://127.0.0.1:8004/redoc to see the domain policy and API documentation.

Check data configuration
tau2 check-data
This command checks if your data directory is properly configured and all required files are present.
Experiments
Running Ablation Studies (No User, or Agent with Oracle Plan)
telecom domain enables running ablation studies.
1. Running an LLM in no-user mode. In this mode, the LLM is given all the tools and the information upfront. Just choose llm_agent_solo as the agent and dummy_user as the user.
tau2 run \ --domain telecom \ --agent llm_agent_solo \ --agent-llm gpt-4.1 \ --user dummy_user \ ...
2. Running an LLM in oracle-plan mode. In this mode, the LLM is given an oracle plan ahead of time alleviating the need for action planning. Just choose llm_agent_gt as the agent.
tau2 run \ --domain telecom \ --agent llm_agent_gt \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ ...
Running Telecom Domain with Workflow Policy
To test the impact of policy format, we provide an additional "workflow" policy for the telecom domain. To run using this policy, use the telecom-workflow domain.
tau2 run \ --domain telecom-workflow \ --agent-llm gpt-4.1 \ --user-llm gpt-4.1 \ ...
Domains
For all the details see the domains [README](src/tau2/domains/README.md).
Basics
- Code is located in
src/tau2/domains/ - Data is located in
data/tau2/domains/ - Each domain has its own configuration and task definitions
View domain-specific policy and API docs:
Run the following command to see the domain policy and API documentation.
tau2 env
Then visit http://127.0.0.1:8004/redoc
Environment CLI (beta)
An interactive command-line interface for directly querying and testing domain environments. Features:
- Interactive query interface with domain-specific tools
- Support for multiple domains (airline, mock, etc.)
- Session management with history
To use:
make env-cli
Available commands:
:q- quit the program:d- change domain:n- start new session (clears history)
Example usage:
$ make env-cli Welcome to the Environment CLI! Connected to airline domain. Query (:n new session, :d change domain, :q quit)> What flights are available from SF to LA tomorrow? Assistant: Let me check the flight availability for you... [Flight details will appear here]
The Environment CLI is useful for:
- Testing domain tools and queries
- Debugging environment responses
- Exploring available domain functionality
- Quick domain interaction without starting the full server stack
Run tests
To run the test suite use the command
make test
Config
To configure the framework, see the [config](src/tau2/config.py) file.
LLM Calls caching
LLM call caching is disabled by default.
To enable LLM calls caching:
- Make sure
redisis running. - Update the redis config in
config.pyif necessary. - Set
LLM_CACHE_ENABLEDtoTrueinconfig.py
Evaluate Your Own Agent
For local or remote agent evaluation, see our [agent developer guide](src/tau2/agent/README.md).
Orchestration Sequence Diagram
sequenceDiagram participant O as Orchestrator participant A as Agent participant U as UserSimulator participant E as Environment Note over O: Initialize(task) rect rgb(100, 150, 150) O->>A:…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork of a benchmark repo by the same org.