arcee-ai/optillm-upstream
forked from algorithmicsuperintelligence/optillm
Captured source
source ↗arcee-ai/optillm-upstream
Description: Optimizing inference proxy for LLMs
Language: Python
License: Apache-2.0
Stars: 2
Forks: 0
Open issues: 0
Created: 2024-10-24T23:39:59Z
Pushed: 2024-10-25T06:53:30Z
Default branch: main
Fork: yes
Parent repository: algorithmicsuperintelligence/optillm
Archived: no
README:
optillm
optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries. It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time.
 
Installation
Using pip
pip install optillm optillm 2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy 2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory 2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto
Install from source
Clone the repository with git and use pip install to setup the dependencies.
git clone https://github.com/codelion/optillm.git cd optillm python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Set up the OPENAI_API_KEY environment variable (for OpenAI) or the AZURE_OPENAI_API_KEY, AZURE_API_VERSION and AZURE_API_BASE environment variables (for Azure OpenAI) or the AZURE_API_VERSION and AZURE_API_BASE environment variables and login using az login for Azure OpenAI with managed identity (see here).
You can then run the optillm proxy as follows.
python optillm.py
2024-09-06 07:57:14,191 - INFO - Starting server with approach: auto
2024-09-06 07:57:14,191 - INFO - Server configuration: {'approach': 'auto', 'mcts_simulations': 2, 'mcts_exploration': 0.2, 'mcts_depth': 1, 'best_of_n': 3, 'model': 'gpt-4o-mini', 'rstar_max_depth': 3, 'rstar_num_rollouts': 5, 'rstar_c': 1.4, 'base_url': ''}
* Serving Flask app 'optillm'
* Debug mode: off
2024-09-06 07:57:14,212 - INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:8000
* Running on http://192.168.10.48:8000
2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quitStarting the optillm proxy for a local server (e.g. llama.cpp)
- Set the
OPENAI_API_KEYenv variable to a placeholder value - e.g.
export OPENAI_API_KEY="no_key" - Run
./llama-server -c 4096 -m path_to_modelto start the server with the specified model and a context length of 4096 tokens - Run
python3 optillm.py --base_url base_urlto start the proxy - e.g. for llama.cpp, run
python3 optillm.py --base_url http://localhost:8080/v1
> [!WARNING] > Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following: > cot_reflection, leap, plansearch, rstar, rto, self_consistency, re2, and z3.
> [!NOTE] > You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.
Usage
Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the base_url as http://localhost:8000/v1.
import os
from openai import OpenAI
OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_BASE_URL = "http://localhost:8000/v1"
client = OpenAI(api_key=OPENAI_KEY, base_url=OPENAI_BASE_URL)
response = client.chat.completions.create(
model="moa-gpt-4o",
messages=[
{
"role": "user",
"content": "Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."
}
],
temperature=0.2
)
print(response)The code above applies to both OpenAI and Azure OpenAI, just remember to populate the OPENAI_API_KEY env variable with the proper key. There are multiple ways to control the optimization techniques, they are applied in the follow order of preference:
- You can control the technique you use for optimization by prepending the slug to the model name
{slug}-model-name. E.g. in the above code we are usingmoaor mixture of agents as the optimization approach. In the proxy logs you will see the following showing themoais been used with the base model asgpt-4o-mini.
2024-09-06 08:35:32,597 - INFO - Using approach moa, with gpt-4o-mini 2024-09-06 08:35:35,358 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-09-06 08:35:39,553 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-09-06 08:35:44,795 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-09-06 08:35:44,797 - INFO - 127.0.0.1 - - [06/Sep/2024 08:35:44] "POST /v1/chat/completions HTTP/1.1" 200 -
- Or, you can pass the slug in the
optillm_approachfield in theextra_body.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{ "role": "user","content": "" }],
temperature=0.2,
extra_body={"optillm_approach": "bon|moa|mcts"}
)- Or, you can just mention the approach in either your
systemoruserprompt, withintags.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{ "role": "user","content": "re2 How many r's are there in strawberry?" }],
temperature=0.2
)> [!TIP] > You can also combine different techniques either by using symbols & and |. When you use & the techniques are processed in the order from left to right in a pipeline > with response from previous stage used as request to the next. While, with | we run all the requests in parallel and generate multiple responses that are returned as a list.
Please note that the convention described above works only when the optillm server has been started with inference approach set to auto. Otherwise, the model attribute in the…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork, low traction