What does this repo signal mean?

OpenBMB (MiniCPM) published OpenBMB/RaD-Agent (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo OpenBMB/RaD-Agent · language Python · Very low stars, routine repo. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

OpenBMB (MiniCPM) Repo: OpenBMB/RaD-Agent

Captured source

source ↗

GitHub/github.com/OpenBMB/RaD-Agent

OpenBMB/RaD-Agent repository metadata

Source ↗

published Nov 4, 2024seen 5dcaptured 11hhttp 200method plain

OpenBMB/RaD-Agent

Description: The official implementation of the Rational Decision-Making Agent with Internalized Utility Judgment

Language: Python

Stars: 9

Forks: 4

Open issues: 0

Created: 2024-11-04T12:46:05Z

Pushed: 2024-11-12T02:37:48Z

Default branch: main

Fork: no

Archived: no

README:

📖 Overview

With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently.

In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes.

Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

Our paper is released here.

⚙️ Environment Setup

Here is the guideline of how to run RaDAgent method in different downstream-tasks.

You need to first clone the repo and install the project dependency:

python -m pip install -e ./

Then, before you start, you need to put your OpenAI API Keys in someplace of the Repo ets_utils.py.

🚀 Running

To run on the downstream tasks we report in the paper, use the following guides.

🛒 Webshop

Start Server

run the following commands:

cd run_webshop
python env_server.py

The IP and port can be set in env_server.py.

host = "localhost" # your host ip
port = 12348 # your host port

Run Benchmark

cd run_webshop
bash run_task.sh

You can set the connection ip and port in run_webshop\webshopping_env_online.py.

Evaluation

cd run_webshop
bash eval.sh

🌐 RestGPT

Configure the REST API environment (TMDB, Spotify) by following the instructions in RestGPT.

Start Server

run the following commands:

cd run_restbench
python env_server.py

The IP and port can be set in env_server.py.

host = "localhost" # your host ip
port = 12348 # your host port

Run Benchmark

cd run_restbench
bash run_task.sh

You can set the connection ip and port in run_restbench\restbench_env.py.

Evaluation

cd run_restbench
python restbench_eval.py

🔢 Game of 24

LLM-Based Baseline

We follow the setting of ToT, which uses a ./Downstream_tasks/24.csv and test the last 100 problems, which is the hardest in the dataset. Then, you can run the experiment:

python ./test_codes/test_24.py

You need to specify some arguments like input or output dir, process_num, in that python file. This code is not only for Elo-tree-search, you can also specify DFS, DFSDT, BFS method in the --method part.

MCTS baseline

Also, we implement an MCTS baseline, based on the traditional UCT tree-search method, it has different versions, the difference in get_reward function, you can specify it. To run the baseline, use the follow command:

python ./test_codes/MCTS.py

> MCTS can also finds a result of each case, but it uses 100 times more simulations than ETS. We give a compare in our paper.

🛠 ToolBench

To test on Toolbench, it is a little complex. You need to

1. request a ToolServer toolbenchkey following the guide here, also you can build it locally after getting a toolbench key, you need to specify the toolbench_key in ets_utils.py 2. add denpendency: We use an early version of toolbench, so you need to unzip the assets.zip to unzip the denpendency.

After setting up toolbench env, you can run the commands:

python answer_generation.py

In our experiment, we use this split``./assetstoolbench_test_data_0925/test_query_ids. You can try different methods in --method` part, like DFS, BFS, ETS. The explanation of the hyperparameters are described in the following part.

> Because gpt-3.5-turbo-0613 can not be Requested by OpenAI, and the main experiement is performed in 2023.07, Many Rapid-API server is not exists today, the score may not re-implemented today. But we have hold the original ToolBench Test result of our main experiment, If you have any problems reimplementing ToolBench experiment, you can connect yeyn2001@gmail.com

📊 Experimental Results

| Model | Game of 24 | WebShop | ToolBench | |-----------|------------|---------|-----------| | CoT | 6.00 | 56.23 | 16.60 | | CoT@3 | 7.00 | 56.45 | 31.20 | | Reflexion | 7.00 | 57.21 | 26.60 | | ToT-BFS | 11.00 | 50.20 | 38.00 | | ToT-DFS | 14.00 | 55.60 | 45.58 | | DFSDT | 29.00 | 57.25 | 50.20 | | RaDAgent | 43.00 | 59.36 | 61.92 |

| Model | TMDB | Spotify | |---------------------|-------|---------| | Offline | 33.0 | 36.4 | | DEPS | 43.0 | 43.8 | | ReAct | 57.0 | 49.1 | | Reflexion | 59.0 | 61.4 | | RestGPT | 79.0 | 74.5 | | RestGPT(ChatGPT) | 65.0 | 72.3 | | RaDAgent | 84.0 | 80.7 |

🔎 Hyperparameter Explanation

Here, we just parse your method name, and specify it into some format ETS method such as ETS_all-100_annealing_k50_sqrt_s100_f1_t173.72_p0.9_c15_m3_rn1_rg3. we mainly split the name by "_", the logic is at ./test_codes/test_24.py, line 340-380

For ETS:

ETS means Elo-base Tree Search, the main method of RaD-Agent
K50, means how to update the elo score in the equation. bigger k means to update elo score fastly. We set this to 50…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Very low stars, routine repo