OpenBMB/RaD-Agent
Python
Captured source
source ↗OpenBMB/RaD-Agent
Description: The official implementation of the Rational Decision-Making Agent with Internalized Utility Judgment
Language: Python
Stars: 9
Forks: 4
Open issues: 0
Created: 2024-11-04T12:46:05Z
Pushed: 2024-11-12T02:37:48Z
Default branch: main
Fork: no
Archived: no
README:
📖 Overview
With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently.
In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes.
Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.
Our paper is released here.
⚙️ Environment Setup
Here is the guideline of how to run RaDAgent method in different downstream-tasks.
You need to first clone the repo and install the project dependency:
python -m pip install -e ./
Then, before you start, you need to put your OpenAI API Keys in someplace of the Repo ets_utils.py.
🚀 Running
To run on the downstream tasks we report in the paper, use the following guides.
🛒 Webshop
Start Server
run the following commands:
cd run_webshop python env_server.py
The IP and port can be set in env_server.py.
host = "localhost" # your host ip port = 12348 # your host port
Run Benchmark
cd run_webshop bash run_task.sh
You can set the connection ip and port in run_webshop\webshopping_env_online.py.
Evaluation
cd run_webshop bash eval.sh
🌐 RestGPT
Configure the REST API environment (TMDB, Spotify) by following the instructions in RestGPT.
Start Server
run the following commands:
cd run_restbench python env_server.py
The IP and port can be set in env_server.py.
host = "localhost" # your host ip port = 12348 # your host port
Run Benchmark
cd run_restbench bash run_task.sh
You can set the connection ip and port in run_restbench\restbench_env.py.
Evaluation
cd run_restbench python restbench_eval.py
🔢 Game of 24
LLM-Based Baseline
We follow the setting of ToT, which uses a ./Downstream_tasks/24.csv and test the last 100 problems, which is the hardest in the dataset. Then, you can run the experiment:
python ./test_codes/test_24.py
You need to specify some arguments like input or output dir, process_num, in that python file. This code is not only for Elo-tree-search, you can also specify DFS, DFSDT, BFS method in the --method part.
MCTS baseline
Also, we implement an MCTS baseline, based on the traditional UCT tree-search method, it has different versions, the difference in get_reward function, you can specify it. To run the baseline, use the follow command:
python ./test_codes/MCTS.py
> MCTS can also finds a result of each case, but it uses 100 times more simulations than ETS. We give a compare in our paper.
🛠 ToolBench
To test on Toolbench, it is a little complex. You need to
1. request a ToolServer toolbenchkey following the guide here, also you can build it locally after getting a toolbench key, you need to specify the toolbench_key in ets_utils.py 2. add denpendency: We use an early version of toolbench, so you need to unzip the assets.zip to unzip the denpendency.
After setting up toolbench env, you can run the commands:
python answer_generation.py
In our experiment, we use this split``./assetstoolbench_test_data_0925/test_query_ids. You can try different methods in --method` part, like DFS, BFS, ETS. The explanation of the hyperparameters are described in the following part.
> Because gpt-3.5-turbo-0613 can not be Requested by OpenAI, and the main experiement is performed in 2023.07, Many Rapid-API server is not exists today, the score may not re-implemented today. But we have hold the original ToolBench Test result of our main experiment, If you have any problems reimplementing ToolBench experiment, you can connect yeyn2001@gmail.com
📊 Experimental Results
| Model | Game of 24 | WebShop | ToolBench | |-----------|------------|---------|-----------| | CoT | 6.00 | 56.23 | 16.60 | | CoT@3 | 7.00 | 56.45 | 31.20 | | Reflexion | 7.00 | 57.21 | 26.60 | | ToT-BFS | 11.00 | 50.20 | 38.00 | | ToT-DFS | 14.00 | 55.60 | 45.58 | | DFSDT | 29.00 | 57.25 | 50.20 | | RaDAgent | 43.00 | 59.36 | 61.92 |
| Model | TMDB | Spotify | |---------------------|-------|---------| | Offline | 33.0 | 36.4 | | DEPS | 43.0 | 43.8 | | ReAct | 57.0 | 49.1 | | Reflexion | 59.0 | 61.4 | | RestGPT | 79.0 | 74.5 | | RestGPT(ChatGPT) | 65.0 | 72.3 | | RaDAgent | 84.0 | 80.7 |
🔎 Hyperparameter Explanation
Here, we just parse your method name, and specify it into some format ETS method such as ETS_all-100_annealing_k50_sqrt_s100_f1_t173.72_p0.9_c15_m3_rn1_rg3. we mainly split the name by "_", the logic is at ./test_codes/test_24.py, line 340-380
For ETS:
- ETS means Elo-base Tree Search, the main method of RaD-Agent
- K50, means how to update the elo score in the equation. bigger k means to update elo score fastly. We set this to 50…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Very low stars, routine repo