What does this repo signal mean?

SambaNova Systems published sambanova/toolbench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo sambanova/toolbench · language Python · New tool/benchmark repo with moderate traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

SambaNova Systems Repo: sambanova/toolbench

Captured source

source ↗

GitHub/github.com/sambanova/toolbench

sambanova/toolbench repository metadata

Source ↗

published May 19, 2023seen Jun 5captured Jun 11http 200method plain

sambanova/toolbench

Description: ToolBench, an evaluation suite for LLM tool manipulation capabilities.

Language: Python

License: Apache-2.0

Stars: 179

Forks: 11

Open issues: 1

Created: 2023-05-19T16:23:35Z

Pushed: 2024-02-28T20:07:35Z

Default branch: main

Fork: no

Archived: no

README:

ToolBench

Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs (e.g. OpenAI), as there is an significant gap of model accuracy between those closed models and all the rest open-source LLMs. To study the root cause of the gap and further facilitate the development of open-source LLMs, especially their capabilities on tool manipulation, we create the ToolBench. The ToolBench is a benchmark consisting of diverse software tools for real-world tasks. We also provide easy-to-use infrastructure in this repository to directly evaluate the execution success rate of each model. Contributions to this repo are highly welcomed! We are excited to see new action generation algorithms and new testing tasks.

[Prerequisites](#prerequisites)
[Installation](#installation)
[Usage](#usage)
[Tasks](#tasks)
[Available Checkpoints](#checkpoints)

Prerequisites

Credentials

Create an OpenAI account and register an API key.
Follow this guide to create a Google Cloud service account and create credentials for the account. Enable Google Sheets API and Google Drive API for the credentials.
Create an account for OpenWeather and register an API key
Register an API key for The Cat API

After registration, update your credentials in credential.sh and

source credential.sh

Software

Conda (anaconda)
Java >= 11.0.13

java -version
conda --version

Installation

Activate a virtual environment

conda create --prefix ./venv python=3.8.13
conda activate ./venv

Download resources

sh download_resources.sh

Press Enter on all the questions. This process may take about 15 minutes.

Installation

pip install -e .

Make sure everything's good!

pytest tests

Installation FAQ

Permission denied: '/tmp/tika.log'

# If you are sharing your machine with someone else, please set
mkdir /tmp/$USER && export TIKA_LOG_PATH=/tmp/$USER

Unable to find libjvm.so

export JAVA_HOME=

Usage

This repository evaluates the API function call success rate on the following tools from the ToolBench: 1. OpenWeather 2. The Cat API 3. Home Search (similar to Redfin) 4. Trip Booking (similar to booking.com) 5. Google Sheet 6. VirtualHome 7. Webshop 8. Tabletop

One can kick off the evaluation job with test.py on any combination of tools, models, number of APIs to retrieve and number of demonstation examples to place in the prompt. Here is an example of evaluating text-davinci-003 on open_weather task.

python test.py \
--task 'open_weather' --version 'v0' \
--top_k_api 10 --top_k_example 3 --num_test_samples -1 \
--client_name "openai" --model_name 'text-davinci-003' --max_output_token 128

All the results will be logged out to the --out_dir, which defults to out/. There will also be a cache created for each client_name and model_name combination as a sqlite database. When you want to query that given LM with a past query (prompt), it will retrieve the answer from the cache directly without running LM inference again.

More examples can be found below:

Evaluation of OpenAI Models

python test.py --task 'open_weather' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'the_cat_api' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 3 --top_k_example 3 --num_test_samples -1
python test.py --task 'virtual_home' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'home_search' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 15 --top_k_example 3 --num_test_samples -1
python test.py --task 'booking' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 300 --top_k_api 15 --top_k_example 3 --num_test_samples -1
python test.py --task 'google_sheets' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 256 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'web_shop' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'web_shop' --version 'v1' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 128 --top_k_api 0 --top_k_example 3 --num_test_samples -1
python test.py --task 'code_as_policies_tabletop' --version 'v0' --client_name "openai" --model_name 'text-davinci-003' --max_output_token 256 --top_k_api 0 --top_k_example 0 --num_test_samples -1

Evaluation of HuggingFace Models Locally

To host a model on a server, independent from this repo, follow manifest.
Find the IP address + port in the output of the commands above, and plug them in to the following commands.

python test.py --task 'open_weather' --version 'v0' --client_name "huggingface" --model 'facebook/opt-iml-30b' --client_connection 'http://10.10.1.98:5000' --max_output_token 128 --top_k_api 10 --top_k_example 3 --num_test_samples -1
python test.py --task 'the_cat_api' --version 'v0' --client_name "huggingface" --model 'facebook/opt-iml-30b' --client_connection 'http://10.10.1.98:5000' --max_output_token 128 --top_k_api 3 --top_k_example 3 --num_test_samples -1
python test.py --task 'virtual_home' --version 'v0'...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New tool/benchmark repo with moderate traction.