RepoReplicateReplicatepublished Jan 18, 2024seen 5d

replicate/cog-triton

Python

Open original ↗

Captured source

source ↗
published Jan 18, 2024seen 5dcaptured 8hhttp 200method plain

replicate/cog-triton

Description: A cog implementation of Nvidia's Triton server

Language: Python

License: Apache-2.0

Stars: 18

Forks: 0

Open issues: 4

Created: 2024-01-18T15:22:36Z

Pushed: 2024-10-23T05:04:23Z

Default branch: main

Fork: no

Archived: no

README:

cog-triton

A cog implementation of Nvidia's Triton server

Error codes

We are using "E[Category][Subcategory][Sequence] [Short Error Name]: [Description]

Universal user errors:

Category 1 (user error), subcategory 0 (framework-agnostic user errors).

  • E1000 GenericError: Generic user error (reserved)
  • E1001 PromptRequired: A prompt is required, but your formatted prompt is blank
  • E1002 PromptTooLong: Prompt length exceeds maximum input length.
  • E1003 BadPromptTemplate: You have submitted both a prompt and a prompt template that doesn't include '{prompt}'.
  • E1004 PromptTemplateError: Prompt template must be a valid python format spec

Triton user errors:

Category 1 (user error), subcategory 1 (triton-specific user errors)

  • E1101 InvalidArgumentMinTokens: Can't set both min_tokens ({min_tokens}) and min_new_tokens ({min_new_tokens})
  • E1102 InvalidArgumentMaxTokens: Can't set both max_tokens ({max_tokens}) and max_new_tokens ({max_new_tokens})

Triton errors:

Category 2 (framework error), subcategory 1 (triton system error)

  • E2100 TritonUnknownError: Unknown error
  • E2101 TritonTimeout: Triton timed out after {TRITON_TIMEOUT}s: httpx.ReadTimeout.
  • E2102 TritonTokenizerError: Tokenizer error: ... the first token of the stop sequence IDs was not '!', which suggests there is a problem with the tokenizer that you are using.
  • E2103 TritonMalformedJSON: Triton returned malformed JSON
  • E2104 TritonMalformedEvent: Triton returned malformed event (no output_ids or error key)

Other frameworks like vLLM might start their error numbering from E2200.

Create a Replicate Model with cog-triton

Currently, we use yolo, a CLI tool we've built to help with non-standard Replicate workflows. To get started, install yolo:

sudo curl -o /usr/local/bin/yolo -L "https://github.com/replicate/yolo/releases/latest/download/yolo_$(uname -s)_$(uname -m)"
sudo chmod +x /usr/local/bin/yolo

Once you have yolo installed, follow these steps:

1. Compile a TensorRT engine with cog-triton

2. If it doesn't exist already, you'll need to create the Replicate Model to which you'll push your cog-triton model

You can create a new Replicate Model via web or our API. To keep things simple, we'll use the latter method.

First, set a Replicate API token.

export REPLICATE_API_TOKEN=
curl -s -X POST -H "Authorization: Token $REPLICATE_API_TOKEN" \
-d '{"owner": "my-username", "name": "my-new-model", "visibility": "private", "hardware": "gpu-a40-large"}' \
https://api.replicate.com/v1/models

We'll call our model staging-gpt2-triton-trt-llm

2. Instantiate a cog-triton model with your TRT-LLM engine

staging-gpt2-triton-trt-llm

yolo push \
--base r8.im/replicate-internal/cog-triton@sha256:5d784bf5f449a0578ceb903265bb756dae146a267fc075b4c77021babedc6637 \
--dest r8.im/replicate-internal/staging-gpt2-triton-trt-llm \
-e COG_WEIGHTS=https://replicate.delivery/pbxt/CUDp32x5hO6GMBWprN8o24vWOLZbnYm7AAoRTxLfe0CUfglkA/engine.tar

Run cog-triton locally

To run cog-triton locally, you must either pull the cog-triton Replicate image or build your own image.

Preparation to run cog-triton locally with Replicate image

Pull and tag the cog-triton image

Go here and pick the version you want to run locally. For our purposes, we'll set the version ID as an environment variable so that the code chunks below won't get stale.

export COG_TRITON_VERSION=

Then, click the version hash. We need to set our Replicate API Token and you can do that manually, or navigate to the HTTP tab in your browser and copy the export command.

export REPLICATE_API_TOKEN=

Next, navigate to the Docker tab under Input. This will display a code chunk like with a Docker run command like:

docker run -d -p 5000:5000 --gpus=all r8.im/replicate-internal/cog-triton@sha256:2db2b5c2e199975fef07ed9045608ed7adc7796744041fa54d3ae9d13db6c3cf

We'll use the image reference to write a pull command:

docker pull r8.im/replicate-internal/cog-triton@sha256:${COG_TRITON_VERSION}

After the image has been pulled, you should tag it so that all the docker commands in this README will work. First, find the IMAGE ID for the image you just pulled, e.g. via docker images. Then run the command below after replacing `` with your image id.

docker tag cog-triton:latest

Run an engine built with cog-trt-llm

Copy all model artifacts from cog-trt-llm/engine_outputs/ to triton_model_repo/tensorrt_llm/1/:

cp -r ../cog-trt-llm/engine_outputs/* triton_model_repo/tensorrt_llm/1/

Run the cog-triton image:

docker run --rm -it -p 5000:5000 --gpus=all --workdir /src --net=host --volume $(pwd)/.:/src/. --ulimit memlock=-1 --shm-size=20g cog-triton /bin/bash
python -m cog.server.http

Make a request:

curl -s -X POST \
-H "Content-Type: application/json" \
-d $'{
"input": {
"prompt": "What is machine learning?"
}
}' \
http://localhost:5000/predictions

Performance tests with test_perf.py

time python3 scripts/test_perf.py --target cog-triton --rate 8 --unit rps --duration 30 --n_input_tokens 100 --n_output_tokens 100

Development

This repository builds 4 different images:

  • cog-triton-builder, which builds TRT-LLM engines.
  • cog-triton-runner-80, suitable to run engines built on, and for, nvidia A100's
  • cog-triton-runner-86, suitable for A40
  • cog-triton-runner-90, suitable for H100 and H200.

Here's a full GPU compatibility list.

End-to-end build process

Cog-triton is pre-release and not stable. This build process currently requires nix to be installed (with the config setting experimental-features = nix-command flakes). We recommend the DeterminateSystems Nix installer, which will set this setting for you.

1. Install nix:

$ curl…

Excerpt shown — open the source for the full document.