ForkArcee AIArcee AIpublished Oct 24, 2024seen 5d

arcee-ai/optillm-upstream

forked from algorithmicsuperintelligence/optillm

Open original ↗

Captured source

source ↗
published Oct 24, 2024seen 5dcaptured 14hhttp 200method plain

arcee-ai/optillm-upstream

Description: Optimizing inference proxy for LLMs

Language: Python

License: Apache-2.0

Stars: 2

Forks: 0

Open issues: 0

Created: 2024-10-24T23:39:59Z

Pushed: 2024-10-25T06:53:30Z

Default branch: main

Fork: yes

Parent repository: algorithmicsuperintelligence/optillm

Archived: no

README:

optillm

optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries. It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.

![Open in Spaces](https://huggingface.co/spaces/codelion/optillm) ![Open In Colab](https://colab.research.google.com/drive/1SpuUb8d9xAoTh32M-9wJsB50AOH54EaH?usp=sharing)

Installation

Using pip

pip install optillm
optillm
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

Install from source

Clone the repository with git and use pip install to setup the dependencies.

git clone https://github.com/codelion/optillm.git
cd optillm
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set up the OPENAI_API_KEY environment variable (for OpenAI) or the AZURE_OPENAI_API_KEY, AZURE_API_VERSION and AZURE_API_BASE environment variables (for Azure OpenAI) or the AZURE_API_VERSION and AZURE_API_BASE environment variables and login using az login for Azure OpenAI with managed identity (see here).

You can then run the optillm proxy as follows.

python optillm.py
2024-09-06 07:57:14,191 - INFO - Starting server with approach: auto
2024-09-06 07:57:14,191 - INFO - Server configuration: {'approach': 'auto', 'mcts_simulations': 2, 'mcts_exploration': 0.2, 'mcts_depth': 1, 'best_of_n': 3, 'model': 'gpt-4o-mini', 'rstar_max_depth': 3, 'rstar_num_rollouts': 5, 'rstar_c': 1.4, 'base_url': ''}
* Serving Flask app 'optillm'
* Debug mode: off
2024-09-06 07:57:14,212 - INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:8000
* Running on http://192.168.10.48:8000
2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit

Starting the optillm proxy for a local server (e.g. llama.cpp)

  • Set the OPENAI_API_KEY env variable to a placeholder value
  • e.g. export OPENAI_API_KEY="no_key"
  • Run ./llama-server -c 4096 -m path_to_model to start the server with the specified model and a context length of 4096 tokens
  • Run python3 optillm.py --base_url base_url to start the proxy
  • e.g. for llama.cpp, run python3 optillm.py --base_url http://localhost:8080/v1

> [!WARNING] > Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following: > cot_reflection, leap, plansearch, rstar, rto, self_consistency, re2, and z3.

> [!NOTE] > You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.

Usage

Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the base_url as http://localhost:8000/v1.

import os
from openai import OpenAI

OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_BASE_URL = "http://localhost:8000/v1"
client = OpenAI(api_key=OPENAI_KEY, base_url=OPENAI_BASE_URL)

response = client.chat.completions.create(
model="moa-gpt-4o",
messages=[
{
"role": "user",
"content": "Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."
}
],
temperature=0.2
)

print(response)

The code above applies to both OpenAI and Azure OpenAI, just remember to populate the OPENAI_API_KEY env variable with the proper key. There are multiple ways to control the optimization techniques, they are applied in the follow order of preference:

  • You can control the technique you use for optimization by prepending the slug to the model name {slug}-model-name. E.g. in the above code we are using moa or mixture of agents as the optimization approach. In the proxy logs you will see the following showing the moa is been used with the base model as gpt-4o-mini.
2024-09-06 08:35:32,597 - INFO - Using approach moa, with gpt-4o-mini
2024-09-06 08:35:35,358 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:39,553 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:44,795 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-09-06 08:35:44,797 - INFO - 127.0.0.1 - - [06/Sep/2024 08:35:44] "POST /v1/chat/completions HTTP/1.1" 200 -
  • Or, you can pass the slug in the optillm_approach field in the extra_body.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{ "role": "user","content": "" }],
temperature=0.2,
extra_body={"optillm_approach": "bon|moa|mcts"}
)
  • Or, you can just mention the approach in either your system or user prompt, within tags.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{ "role": "user","content": "re2 How many r's are there in strawberry?" }],
temperature=0.2
)

> [!TIP] > You can also combine different techniques either by using symbols & and |. When you use & the techniques are processed in the order from left to right in a pipeline > with response from previous stage used as request to the next. While, with | we run all the requests in parallel and generate multiple responses that are returned as a list.

Please note that the convention described above works only when the optillm server has been started with inference approach set to auto. Otherwise, the model attribute in the…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, low traction