What does this repo signal mean?

FriendliAI published friendliai/LLMServingPerfEvaluator (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo friendliai/LLMServingPerfEvaluator · language Python · New LLM serving perf evaluator with 48 stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

FriendliAI Repo: friendliai/LLMServingPerfEvaluator

Captured source

source ↗

GitHub/github.com/friendliai/LLMServingPerfEvaluator

friendliai/LLMServingPerfEvaluator repository metadata

Source ↗

published May 21, 2024seen Jun 5captured Jun 11http 200method plain

friendliai/LLMServingPerfEvaluator

Language: Python

License: Apache-2.0

Stars: 48

Forks: 0

Open issues: 1

Created: 2024-05-21T07:48:20Z

Pushed: 2024-09-07T02:40:03Z

Default branch: main

Fork: no

Archived: yes

README:

LLMServingPerfEvaluator

A tool for evaluating the performance of LLM inference serving engines.

Prerequisites

Make sure you have:

Docker Compose installed
A basic understanding of Docker and Docker Compose

Performance Evaluation Overview

The evaluation workload sends concurrent inference requests, following a Poisson distribution with a given request rate (λ). The Poisson distribution is used for generating requests with the expectation of λ events in a given interval, meaning that the time interval between requests follows an exponential distribution with a mean of 1/λ. (i.e., greater λ values mean more requests are sent to the engine.)

A workload is generated by a given workload_config.yaml. [Here](./src/workload/README.md) are the details of the workload configuration format.

The following [metrics](#1-end-to-end-performance-metrics) are measured in the performance evaluation:

Throughput (requests per second, input tokens per second, output tokens per second)
Latency (seconds) / Token Latency (millisecond)
Time to first token (millisecond, enabled with stream mode)
Time per output tokens (millisecond, enabled with stream mode)

You can simply run the performance evaluation with the docker-compose.yml file in this repo. The docker-compose.yml file is already loaded with the grafana and prometheus containers, so you just simply access http://localhost:3000 to monitor the live performance. Additionally, the end-to-end metrics are saved as a .csv file.

![image](./assets/grafana_dashboard.png)

---

Quick Start with Friendli Engine

In this quick start, we will evaluate the performance of the Friendli Engine. The experiment will be conducted with the following configurations:

Served Model: meta-llama/Meta-Llama-3-8B-Instruct
Hardware Requirement: NVIDIA Ampere or higher GPU with more than 32GB of GPU memory

STEP 0. Clone the Repository

git clone https://github.com/friendliai/LLMServingPerfEvaluator.git
cd LLMServingPerfEvaluator

STEP 1. Prepare working directory

mkdir -p workspace/config/request_config
mkdir -p workspace/config/workload_config
mkdir -p workspace/grafana
mkdir -p workspace/prometheus

STEP 2. Set up the configuration files

In this quick start, we use the following configuration files in this repository:

the workload config file in [examples/workload_config/dummy.yaml](./examples/workload_config/dummy.yaml)
the request config file in [examples/request_config/friendli.yaml](./examples/request_config/friendli.yaml).

Also, we use the grafana and prometheus configuration files in this repository. Please copy [grafana](./grafana/) and [prometheus](./prometheus) directories in this repository to the workspace directory. (grafana/provisioning/compare_engines_dashboard.json does not need to be copied.)

cp examples/workload_config/dummy.yaml workspace/config/workload_config/
cp examples/request_config/friendli.yaml workspace/config/request_config/
cp -r grafana/ workspace/
cp -r prometheus/ workspace/

The workspace directory structure should look like this:

$ tree workspace/ -a
workspace
├── config
│ ├── request_config
│ │ └── friendli.yaml
│ └── workload_config
│ └── dummy.yaml
├── grafana
│ └──provisioning
│ ├── dashboards
│ │ ├── datshboard.yaml
│ │ └── single_engine_dashboard.json
│ └── datasources
│ └── datasource.yaml
└── prometheus
└── config
└── prometheus.yml

>[!NOTE] > workload_config.yaml: Describes how to generate input-output pairs of requests in the experiment. > > Currently, two types of workloads are supported: > - dummy: A synthetic input-output pair with a given length range. > - hf: A input-output pair from Hugging Face dataset. > > For more details, please refer to the [document on generating workloads](./src/workload/README.md).

>[!NOTE] > request_config.yaml: Builds the request body with the engine-specific format. > > In this quick start, we use the Friendli Engine request config below: > ``yaml > # request_config.yaml for Friendli Engine > stream: True > > If stream is set to True, the engine will generate the output tokens one by one, and the Time to first token and Time per output tokens metrics are measured. If stream is set to False`, the metrics are not measured.

> [!NOTE] > You can use request_config example in examples/request_config for other engines. > > If you want to use another request body format, you can add your RequestConfig class in src/engine_client and write your own request config file. > For more details, please refer to [section about adding custom EngineClient](#adding-a-custom-engine-client).

STEP 3. Set up the `docker-compose.yml` and the `.env` file (or environment variables)

Copy the docker-compose.yml file in this repository to the workspace directory. Also, copy the examples/docker_compose/.friendli_env file in this repository to the workspace directory.

cp docker-compose.yml workspace/
cp examples/docker_compose/.friendli_env workspace/.env

When you open the .env file, there are environment variables for the experiment. The environment variables with braces should be replaced with the actual values.

# Experiment Environment Variables
HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model
REQUEST_RATES=1,3,5,7 # The average number of requests per second
DURATION=300 # The duration of the experiment in seconds
TIMEOUT=450 # The timeout of the experiment in seconds
CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml
RESULT_DIR=./result # The directory path to save the end-to-end metrics
CLIENT_TYPE=http # Use http or grpc. Only friendli engine supports grpc client.
GRPC_MODE=false # Use true for grpc, false for http. Only friendli engine supports grpc mode.

# Experiment Environment Variables for the Hugging Face model
# HF_HUB_CACHE={HF_CACHE_DIR}
#...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New LLM serving perf evaluator with 48 stars