friendliai/LLMServingPerfEvaluator
Python
Captured source
source ↗friendliai/LLMServingPerfEvaluator
Language: Python
License: Apache-2.0
Stars: 48
Forks: 0
Open issues: 1
Created: 2024-05-21T07:48:20Z
Pushed: 2024-09-07T02:40:03Z
Default branch: main
Fork: no
Archived: yes
README:
LLMServingPerfEvaluator
A tool for evaluating the performance of LLM inference serving engines.
Prerequisites
Make sure you have:
- Docker Compose installed
- A basic understanding of Docker and Docker Compose
Performance Evaluation Overview
The evaluation workload sends concurrent inference requests, following a Poisson distribution with a given request rate (λ). The Poisson distribution is used for generating requests with the expectation of λ events in a given interval, meaning that the time interval between requests follows an exponential distribution with a mean of 1/λ. (i.e., greater λ values mean more requests are sent to the engine.)
A workload is generated by a given workload_config.yaml. [Here](./src/workload/README.md) are the details of the workload configuration format.
The following [metrics](#1-end-to-end-performance-metrics) are measured in the performance evaluation:
- Throughput (requests per second, input tokens per second, output tokens per second)
- Latency (seconds) / Token Latency (millisecond)
- Time to first token (millisecond, enabled with stream mode)
- Time per output tokens (millisecond, enabled with stream mode)
You can simply run the performance evaluation with the docker-compose.yml file in this repo. The docker-compose.yml file is already loaded with the grafana and prometheus containers, so you just simply access http://localhost:3000 to monitor the live performance. Additionally, the end-to-end metrics are saved as a .csv file.

---
Quick Start with Friendli Engine
In this quick start, we will evaluate the performance of the Friendli Engine. The experiment will be conducted with the following configurations:
- Served Model: meta-llama/Meta-Llama-3-8B-Instruct
- Hardware Requirement: NVIDIA Ampere or higher GPU with more than 32GB of GPU memory
STEP 0. Clone the Repository
git clone https://github.com/friendliai/LLMServingPerfEvaluator.git cd LLMServingPerfEvaluator
STEP 1. Prepare working directory
mkdir -p workspace/config/request_config mkdir -p workspace/config/workload_config mkdir -p workspace/grafana mkdir -p workspace/prometheus
STEP 2. Set up the configuration files
In this quick start, we use the following configuration files in this repository:
- the workload config file in [
examples/workload_config/dummy.yaml](./examples/workload_config/dummy.yaml) - the request config file in [
examples/request_config/friendli.yaml](./examples/request_config/friendli.yaml).
Also, we use the grafana and prometheus configuration files in this repository. Please copy [grafana](./grafana/) and [prometheus](./prometheus) directories in this repository to the workspace directory. (grafana/provisioning/compare_engines_dashboard.json does not need to be copied.)
cp examples/workload_config/dummy.yaml workspace/config/workload_config/ cp examples/request_config/friendli.yaml workspace/config/request_config/ cp -r grafana/ workspace/ cp -r prometheus/ workspace/
The workspace directory structure should look like this:
$ tree workspace/ -a workspace ├── config │ ├── request_config │ │ └── friendli.yaml │ └── workload_config │ └── dummy.yaml ├── grafana │ └──provisioning │ ├── dashboards │ │ ├── datshboard.yaml │ │ └── single_engine_dashboard.json │ └── datasources │ └── datasource.yaml └── prometheus └── config └── prometheus.yml
>[!NOTE] > workload_config.yaml: Describes how to generate input-output pairs of requests in the experiment. > > Currently, two types of workloads are supported: > - dummy: A synthetic input-output pair with a given length range. > - hf: A input-output pair from Hugging Face dataset. > > For more details, please refer to the [document on generating workloads](./src/workload/README.md).
>[!NOTE] > request_config.yaml: Builds the request body with the engine-specific format. > > In this quick start, we use the Friendli Engine request config below: > ``yaml > # request_config.yaml for Friendli Engine > stream: True > > If stream is set to True, the engine will generate the output tokens one by one, and the Time to first token and Time per output tokens metrics are measured. If stream is set to False`, the metrics are not measured.
> [!NOTE] > You can use request_config example in examples/request_config for other engines. > > If you want to use another request body format, you can add your RequestConfig class in src/engine_client and write your own request config file. > For more details, please refer to [section about adding custom EngineClient](#adding-a-custom-engine-client).
STEP 3. Set up the docker-compose.yml and the .env file (or environment variables)
Copy the docker-compose.yml file in this repository to the workspace directory. Also, copy the examples/docker_compose/.friendli_env file in this repository to the workspace directory.
cp docker-compose.yml workspace/ cp examples/docker_compose/.friendli_env workspace/.env
When you open the .env file, there are environment variables for the experiment. The environment variables with braces should be replaced with the actual values.
# Experiment Environment Variables
HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model
REQUEST_RATES=1,3,5,7 # The average number of requests per second
DURATION=300 # The duration of the experiment in seconds
TIMEOUT=450 # The timeout of the experiment in seconds
CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml
RESULT_DIR=./result # The directory path to save the end-to-end metrics
CLIENT_TYPE=http # Use http or grpc. Only friendli engine supports grpc client.
GRPC_MODE=false # Use true for grpc, false for http. Only friendli engine supports grpc mode.
# Experiment Environment Variables for the Hugging Face model
# HF_HUB_CACHE={HF_CACHE_DIR}
#…Excerpt shown — open the source for the full document.