replicate/cog-triton
Python
Captured source
source ↗replicate/cog-triton
Description: A cog implementation of Nvidia's Triton server
Language: Python
License: Apache-2.0
Stars: 18
Forks: 0
Open issues: 4
Created: 2024-01-18T15:22:36Z
Pushed: 2024-10-23T05:04:23Z
Default branch: main
Fork: no
Archived: no
README:
cog-triton
A cog implementation of Nvidia's Triton server
Error codes
We are using "E[Category][Subcategory][Sequence] [Short Error Name]: [Description]
Universal user errors:
Category 1 (user error), subcategory 0 (framework-agnostic user errors).
- E1000 GenericError: Generic user error (reserved)
- E1001 PromptRequired: A prompt is required, but your formatted prompt is blank
- E1002 PromptTooLong: Prompt length exceeds maximum input length.
- E1003 BadPromptTemplate: You have submitted both a prompt and a prompt template that doesn't include '{prompt}'.
- E1004 PromptTemplateError: Prompt template must be a valid python format spec
Triton user errors:
Category 1 (user error), subcategory 1 (triton-specific user errors)
- E1101 InvalidArgumentMinTokens: Can't set both min_tokens ({min_tokens}) and min_new_tokens ({min_new_tokens})
- E1102 InvalidArgumentMaxTokens: Can't set both max_tokens ({max_tokens}) and max_new_tokens ({max_new_tokens})
Triton errors:
Category 2 (framework error), subcategory 1 (triton system error)
- E2100 TritonUnknownError: Unknown error
- E2101 TritonTimeout: Triton timed out after {TRITON_TIMEOUT}s: httpx.ReadTimeout.
- E2102 TritonTokenizerError: Tokenizer error: ... the first token of the stop sequence IDs was not '!', which suggests there is a problem with the tokenizer that you are using.
- E2103 TritonMalformedJSON: Triton returned malformed JSON
- E2104 TritonMalformedEvent: Triton returned malformed event (no output_ids or error key)
Other frameworks like vLLM might start their error numbering from E2200.
Create a Replicate Model with cog-triton
Currently, we use yolo, a CLI tool we've built to help with non-standard Replicate workflows. To get started, install yolo:
sudo curl -o /usr/local/bin/yolo -L "https://github.com/replicate/yolo/releases/latest/download/yolo_$(uname -s)_$(uname -m)" sudo chmod +x /usr/local/bin/yolo
Once you have yolo installed, follow these steps:
1. Compile a TensorRT engine with cog-triton
2. If it doesn't exist already, you'll need to create the Replicate Model to which you'll push your cog-triton model
You can create a new Replicate Model via web or our API. To keep things simple, we'll use the latter method.
First, set a Replicate API token.
export REPLICATE_API_TOKEN=
curl -s -X POST -H "Authorization: Token $REPLICATE_API_TOKEN" \
-d '{"owner": "my-username", "name": "my-new-model", "visibility": "private", "hardware": "gpu-a40-large"}' \
https://api.replicate.com/v1/modelsWe'll call our model staging-gpt2-triton-trt-llm
2. Instantiate a cog-triton model with your TRT-LLM engine
staging-gpt2-triton-trt-llm
yolo push \ --base r8.im/replicate-internal/cog-triton@sha256:5d784bf5f449a0578ceb903265bb756dae146a267fc075b4c77021babedc6637 \ --dest r8.im/replicate-internal/staging-gpt2-triton-trt-llm \ -e COG_WEIGHTS=https://replicate.delivery/pbxt/CUDp32x5hO6GMBWprN8o24vWOLZbnYm7AAoRTxLfe0CUfglkA/engine.tar
Run cog-triton locally
To run cog-triton locally, you must either pull the cog-triton Replicate image or build your own image.
Preparation to run cog-triton locally with Replicate image
Pull and tag the cog-triton image
Go here and pick the version you want to run locally. For our purposes, we'll set the version ID as an environment variable so that the code chunks below won't get stale.
export COG_TRITON_VERSION=
Then, click the version hash. We need to set our Replicate API Token and you can do that manually, or navigate to the HTTP tab in your browser and copy the export command.
export REPLICATE_API_TOKEN=
Next, navigate to the Docker tab under Input. This will display a code chunk like with a Docker run command like:
docker run -d -p 5000:5000 --gpus=all r8.im/replicate-internal/cog-triton@sha256:2db2b5c2e199975fef07ed9045608ed7adc7796744041fa54d3ae9d13db6c3cf
We'll use the image reference to write a pull command:
docker pull r8.im/replicate-internal/cog-triton@sha256:${COG_TRITON_VERSION}After the image has been pulled, you should tag it so that all the docker commands in this README will work. First, find the IMAGE ID for the image you just pulled, e.g. via docker images. Then run the command below after replacing `` with your image id.
docker tag cog-triton:latest
Run an engine built with cog-trt-llm
Copy all model artifacts from cog-trt-llm/engine_outputs/ to triton_model_repo/tensorrt_llm/1/:
cp -r ../cog-trt-llm/engine_outputs/* triton_model_repo/tensorrt_llm/1/
Run the cog-triton image:
docker run --rm -it -p 5000:5000 --gpus=all --workdir /src --net=host --volume $(pwd)/.:/src/. --ulimit memlock=-1 --shm-size=20g cog-triton /bin/bash python -m cog.server.http
Make a request:
curl -s -X POST \
-H "Content-Type: application/json" \
-d $'{
"input": {
"prompt": "What is machine learning?"
}
}' \
http://localhost:5000/predictionsPerformance tests with test_perf.py
time python3 scripts/test_perf.py --target cog-triton --rate 8 --unit rps --duration 30 --n_input_tokens 100 --n_output_tokens 100
Development
This repository builds 4 different images:
cog-triton-builder, which builds TRT-LLM engines.cog-triton-runner-80, suitable to run engines built on, and for, nvidia A100'scog-triton-runner-86, suitable for A40cog-triton-runner-90, suitable for H100 and H200.
Here's a full GPU compatibility list.
End-to-end build process
Cog-triton is pre-release and not stable. This build process currently requires nix to be installed (with the config setting experimental-features = nix-command flakes). We recommend the DeterminateSystems Nix installer, which will set this setting for you.
1. Install nix:
$ curl…
Excerpt shown — open the source for the full document.