RepoTencent HunyuanTencent Hunyuanpublished Jun 26, 2025seen 5d

Tencent-Hunyuan/C3-Benchmark

Python

Open original ↗

Captured source

source ↗
published Jun 26, 2025seen 5dcaptured 8hhttp 200method plain

Tencent-Hunyuan/C3-Benchmark

Description: C^3-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

Language: Python

License: NOASSERTION

Stars: 38

Forks: 3

Open issues: 0

Created: 2025-06-26T13:37:43Z

Pushed: 2026-03-01T15:28:39Z

Default branch: main

Fork: no

Archived: no

README:

C^3-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

📖 English • 中文

🤗 Dataset • 📚 Preprint Paper

![Example](./picture/first.png)

🎆 News

  • This repo has moved to https://github.com/yupeijei1997/WildToolBench, Please check our latest progress there.
  • 2026.1.26 🎉🎉🎉 Our paper Benchmarking LLM Tool-Use in the Wild is accepted by ICLR 2026!

📖 Overview

Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark C^3-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, C^3-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance.

😊 Key Materials

  • Test data location: c3_bench/data/C3-Bench.jsonl or 🤗 Dataset
  • More detailed information about the C3-Bench can be found below

⚡️ Quickstart

Basic Installation

# Create a new Conda environment with Python 3.10
conda create -n C3-Bench python=3.10
conda activate C3-Bench

# Clone the C3-Bench repository
git clone https://github.com/Tencent-Hunyuan/C3-Benchmark.git

# Change directory to the `c3_bench`
cd c3_bench/

# Install the package
pip install -r requirements.txt

⏳ Inference

💾 Test Data

![overall](./picture/example.png)

Address: c3_bench/data/C3-Bench.jsonl

Description: Our test data has undergone five rounds of manual inspection and correction by five senior algorithm researcher with years of experience in NLP, CV, and LLM, taking about one month in total. It boasts extremely high quality and accuracy, with a tight connection between multiple rounds of tasks, increasing difficulty, no unusable invalid data, and complete consistency with human distribution. Its evaluation results and conclusions are of great reference value for subsequent optimization in the Agent direction.

Specifically, the data quality optimization work went through the following stages:

1. The initial data was generated using our proposed Multi Agent Data Generation framework, covering all possible action spaces.

2. The test data was then divided according to four different types of actions defined by us and manually inspected and corrected by four different algorithm researcher. Specifically, since tasks generated by LLM are always too formal and not colloquial enough, especially after the second task, it is difficult to generate true multi-turn tasks. Therefore, we conducted the first round of corrections based on the criteria of colloquialism and true multi-turn tasks. Notably, in designing the third and fourth round tasks, we added tasks with long-term memory, a true multi-turn type, to increase the difficulty of the test set.

Note: In the actual construction process, the four algorithm researcher adopted a layer-by-layer approach, first generating a layer of data with the model, then manually inspecting and correcting it, before generating and correcting the next layer of data. This approach avoids the difficulty of ensuring overall correctness and maintaining data coherence when, after generating all layers of data at once, a problem in one layer requires corrections that often affect both the previous and subsequent layers. Thus, our layer-by-layer construction ensures strong logical consistency and close relationships between layers, without any unreasonable trajectories.

3. After the first round of corrections by the four algorithm researcher, one senior experts in the Agent field would comment on each piece of data, indicating whether it meets the requirements and what problems exist, followed by a second correction by the four algorithm researcher.

4. After the second round of corrections, we introduced cross-validation, where the four algorithm researcher inspected and commented on each other's data. Then, the four algorithm researcher and one senior experts in the Agent field discussed and made a third round of corrections on the doubtful data.

5. After the third round of corrections, the one senior experts in the Agent field separately conducted a fourth round of inspection and correction on all data to ensure absolute accuracy.

6. Finally, since human corrections might introduce errors, we used code to check for possible parameter type errors and unreasonable dependencies caused by manual operations, with one senior experts making the final fifth round of corrections.

Through these five stages of data quality optimization, each piece of data was manually corrected and constructed by multiple algorithm experts, improving our test data's accuracy from less than 60% initially to 100% correctness. The combination of model generation and multiple human corrections also endowed our data with excellent diversity and quality.

At the same time, compared to other benchmarks such as BFCL, T-EVAL, etc., our test data covers all possible action spaces, and in the second to fourth rounds of true multi-turn tasks, the coverage rate has reached two 100%, which also makes our data distribution very…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New benchmark repo, low traction