RepoSambaNova SystemsSambaNova Systemspublished May 19, 2023seen 5d

sambanova/toolbench

Python

Open original ↗

Captured source

source ↗
published May 19, 2023seen 5dcaptured 8hhttp 200method plain

sambanova/toolbench

Description: ToolBench, an evaluation suite for LLM tool manipulation capabilities.

Language: Python

License: Apache-2.0

Stars: 179

Forks: 11

Open issues: 1

Created: 2023-05-19T16:23:35Z

Pushed: 2024-02-28T20:07:35Z

Default branch: main

Fork: no

Archived: no

README:

ToolBench

Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs (e.g. OpenAI), as there is an significant gap of model accuracy between those closed models and all the rest open-source LLMs. To study the root cause of the gap and further facilitate the development of open-source LLMs, especially their capabilities on tool manipulation, we create the ToolBench. The ToolBench is a benchmark consisting of diverse software tools for real-world tasks. We also provide easy-to-use infrastructure in this repository to directly evaluate the execution success rate of each model. Contributions to this repo are highly welcomed! We are excited to see new action generation algorithms and new testing tasks.

Table of contents

  • [Prerequisites](#prerequisites)
  • [Installation](#installation)
  • [Usage](#usage)
  • [Tasks](#tasks)
  • [Available Checkpoints](#checkpoints)

Prerequisites

Credentials

  • Create an OpenAI account and register an API key.
  • Follow this guide to create a Google Cloud service account and create credentials for the account. Enable Google Sheets API and Google Drive API for the credentials.
  • Create an account for OpenWeather and register an API key
  • Register an API key for The Cat API

After registration, update your credentials in credential.sh and

source credential.sh

Software

java -version
conda --version

Installation

  • Activate a virtual environment
conda create --prefix ./venv python=3.8.13
conda activate ./venv
  • Download resources
sh download_resources.sh

Press Enter on all the questions. This process may take about 15 minutes.

  • Installation
pip install -e .
  • Make sure everything's good!
pytest tests

Installation FAQ

  • Permission denied: '/tmp/tika.log'
# If you are sharing your machine with someone else, please set
mkdir /tmp/$USER && export TIKA_LOG_PATH=/tmp/$USER
  • Unable to find libjvm.so
export JAVA_HOME=

Usage

This repository evaluates the API function call success rate on the following tools from the ToolBench: 1. OpenWeather 2. The Cat API 3. Home Search (similar to Redfin) 4. Trip Booking (similar to booking.com) 5. Google Sheet 6. VirtualHome 7. Webshop 8. Tabletop

One can kick off the evaluation job with test.py on any combination of tools, models, number of APIs to retrieve and number of demonstation examples to place in the prompt. Here is an example of evaluating text-davinci-003 on open_weather task.

python test.py \
--task 'open_weather' --version 'v0' \
--top_k_api 10 --top_k_example 3 --num_test_samples -1 \
--client_name "openai" --model_name 'text-davinci-003' --max_output_token 128

All the results will be logged out to the --out_dir, which defults to out/. There will also be a cache created for each client_name and model_name combination as a sqlite database. When you want to query that given LM with a past query (prompt), it will retrieve the answer from the cache directly without running LM inference again.

More examples can be found below:

Evaluation of OpenAI Models

python test.py --task 'open_weather' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'the_cat_api' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 3 --top_k_example 3 --num_test_samples -1
python test.py --task 'virtual_home' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'home_search' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 15 --top_k_example 3 --num_test_samples -1
python test.py --task 'booking' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 300 --top_k_api 15 --top_k_example 3 --num_test_samples -1
python test.py --task 'google_sheets' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 256 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'web_shop' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'web_shop' --version 'v1' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'code_as_policies_tabletop' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 256 --top_k_api 0 --top_k_example 0 --num_test_samples -1

Evaluation of HuggingFace Models Locally

  • To host a model on a server, independent from this repo, follow manifest.
  • Find the IP address + port in the output of the commands above, and plug them in to the following commands.
python test.py --task 'open_weather' --version 'v0' --client_name "huggingface" --model 'facebook/opt-iml-30b' --client_connection 'http://10.10.1.98:5000' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'the_cat_api' --version 'v0' --client_name "huggingface" --model 'facebook/opt-iml-30b' --client_connection 'http://10.10.1.98:5000' --max_output_token 128 --top_k_api 3 --top_k_example 3 --num_test_samples -1
python test.py --task 'virtual_home' --version 'v0'…

Excerpt shown — open the source for the full document.