ForkTogether AITogether AIpublished Oct 13, 2025seen 5d

togethercomputer/genai-bench

forked from sgl-project/genai-bench

Open original ↗

Captured source

source ↗
published Oct 13, 2025seen 5dcaptured 14hhttp 200method plain

togethercomputer/genai-bench

Description: Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-10-13T23:23:23Z

Pushed: 2026-05-22T18:15:39Z

Default branch: main

Fork: yes

Parent repository: sgl-project/genai-bench

Archived: no

README:

| User Guide | Contribution Guideline |

Introduction

Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.

It provides detailed insights into model serving performance, offering both a user-friendly CLI and a live UI for real-time progress monitoring.

Features

  • 🛠️ CLI Tool: Validates user inputs and initiates benchmarks seamlessly.
  • 📊 Live UI Dashboard: Displays current progress, logs, and real-time metrics.
  • 📝 Rich Logs: Automatically flushed to both terminal and file upon experiment completion.
  • 📈 Experiment Analyzer: Generates comprehensive Excel reports with pricing and raw metrics data, plus flexible plot configurations (default 2x4 grid) that visualize key performance metrics including throughput, latency (TTFT, E2E, TPOT), error rates, and RPS across different traffic scenarios and concurrency levels. Supports custom plot layouts and multi-line comparisons.

How to Start

Please check User Guide and CONTRIBUTING.md for how to install and use genai-bench.

Benchmark Metrics Definition

This section puts together the standard metrics required for LLM serving performance analysis. We classify metrics to two types: single-request level metrics, representing the metrics collected from one request. And aggregated level metrics, summarizing the single-request metrics from one run (with specific traffic scenario and num concurrency).

NOTE:

  • Each single-request metric includes standard statistics: percentile, min, max, stddev, and mean.
  • The following metrics cover input, output, and end-to-end (e2e) stages. For *chat* tasks, all stages are relevant for evaluation. For *embedding* tasks, where there is no output stage, output metrics will be set to 0. For details about output metrics collection, please check out OUTPUT_METRICS_FIELDS in [metrics.py](genai_bench/metrics/metrics.py).

Single Request Level Metrics

The following metrics capture token-level performance for a single request, providing insights into server efficiency for each individual request.

| Glossary | Meaning | Calculation Formula | Units | |------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|---------------| | TTFT | Time to First Token. Initial response time when the first output token is generated. This is also known as the latency for the input (input) stage. | TTFT = time_at_first_token - start_time | seconds | | End-to-End Latency | End-to-End latency. This metric indicates how long it takes from submitting a query to receiving the full response, including network latencies. | e2e_latency = end_time - start_time | seconds | | TPOT | Time Per Output Token. The average time between two subsequent generated tokens. | TPOT = (e2e_latency - TTFT) / (num_output_tokens - 1) | seconds | | Output Latency | Output latency. This metric indicates how long it takes to receive the full response after the first token is generated. | output_latency = e2e_latency - TTFT | seconds | | Output Inference Speed | The rate of how many tokens the model can generate per second for a single request. | inference_speed = 1 / TPOT | tokens/second | | Num of Input Tokens | Number of prompt tokens. | num_input_tokens = tokenizer.encode(prompt) | tokens | | Num of Output Tokens | Number of output tokens. | num_output_tokens = num_completion_tokens | tokens | | Num of Request Tokens | Total number of tokens processed in one request. | num_request_tokens = num_input_tokens + num_output_tokens | tokens | | Input Throughput | The overall throughput of input (input process). | input_throughput = num_input_tokens / TTFT | tokens/second | | Output Throughput | The throughput of output (output generation) for a single request. | output_throughput = (num_output_tokens - 1) / output_latency | tokens/second |

Aggregated Metrics

This metrics collection summarizes the metrics relevant to a specific traffic load pattern, defined by the traffic scenario and the num of concurrency. It provides insights into server capacity and performance under pressure.

| Glossary | Meaning | Calculation Formula | Units | |---------------------------|------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------| | Mean Input Throughput | The average throughput of how many input tokens can be processed by the model in one run with multiple concurrent requests. | mean_input_throughput = sum(input_tokens_for_all_requests) / run_duration | tokens/second | | Mean Output Throughput | The average throughput of how many output tokens can be processed by the model in one run with multiple concurrent requests. | mean_output_throughput = sum(output_tokens_for_all_requests) / run_duration | tokens/second | | Total Tokens Throughput | The average throughput of how many tokens can be processed by the model, including both input and output tokens. | mean_total_tokens_throughput = all_requests["total_tokens"]["sum"] / run_duration | tokens/second | | Total Chars Per Hour[^1] | The average total characters can be processed by the model per hour. | total_chars_per_hour = total_tokens_throughput * dataset_chars_to_token_ratio * 3600 | Characters | | Requests Per Minute | The number of requests processed by the model per minute. | num_completed_requests_per_min = num_completed_requests / (end_time - start_time) * 60 | Requests | | Error Codes to Frequency | A map that shows the returned error status code to its frequency. | | | | Error Rate | The rate of error requests over total…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no notable activity.