What does this fork signal mean?

Together AI forked togethercomputer/TensorRT-LLM (forked from NVIDIA/TensorRT-LLM). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo togethercomputer/TensorRT-LLM · parent NVIDIA/TensorRT-LLM. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Together AI Fork: togethercomputer/TensorRT-LLM

Captured source

source ↗

GitHub/github.com/togethercomputer/TensorRT-LLM

togethercomputer/TensorRT-LLM repository metadata

Source ↗

published Apr 3, 2024seen 5dcaptured 9hhttp 200method plain

togethercomputer/TensorRT-LLM

Description: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 3

Created: 2024-04-03T23:17:18Z

Pushed: 2024-07-25T11:47:58Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/TensorRT-LLM

Archived: no

README:

TensorRT-LLM =========================== A TensorRT Toolbox for Optimized Large Language Model Inference

[Architecture](./docs/source/architecture.md) | [Results](./docs/source/performance.md) | [Examples](./examples/) | [Documentation](./docs/source/)

---

Latest News

[*Weekly*] Check out [@NVIDIAAIDev](https://twitter.com/nvidiaaidev?lang=en) & [NVIDIA AI](https://www.linkedin.com/showcase/nvidia-ai/) LinkedIn for the latest updates!
[2024/02/06] [🚀 Speed up inference with SOTA quantization techniques in TRT-LLM](./docs/source/blogs/quantization-in-TRT-LLM.md)
[2024/01/30] [ New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](./docs/source/blogs/XQA-kernel.md)
[2023/12/04] [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](./docs/source/blogs/Falcon180B-H200.md)
[2023/11/27] SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version
[2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
[2023/10/22] 🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙
[2023/10/19] Getting Started Guide - [Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

](https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/)

[2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)

[TensorRT-LLM](#tensorrt-llm)
[Latest News](#latest-news)
[Table of Contents](#table-of-contents)
[TensorRT-LLM Overview](#tensorrt-llm-overview)
[Installation](#installation)
[Quick Start](#quick-start)
[Support Matrix](#support-matrix)
[Devices](#devices)
[Precision](#precision)
[Key Features](#key-features)
[Models](#models)
[Performance](#performance)
[Advanced Topics](#advanced-topics)
[Quantization](#quantization)
[In-flight Batching](#in-flight-batching)
[Attention](#attention)
[Graph Rewriting](#graph-rewriting)
[Benchmark](#benchmark)
[Troubleshooting](#troubleshooting)
[Release notes](#release-notes)
[Change Log](#change-log)
[Versions 0.8.0](#versions-080)
[For history change log, please see CHANGELOG.md.](#for-history-change-log-please-see-changelogmd)
[Known Issues](#known-issues)
[Report Issues](#report-issues)

TensorRT-LLM Overview

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The Python API of TensorRT-LLM is architectured to look similar to the PyTorch API. It provides users with a [functional](./tensorrt_llm/functional.py) module containing functions like einsum, softmax, matmul or view. The [layers](./tensorrt_llm/layers) module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the [models](./tensorrt_llm/models) module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. See below for a list of supported [models](#Models).

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see [examples/gpt](./examples/gpt) for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

For a more detailed presentation of the software architecture and the key concepts used in TensorRT-LLM, we recommend you to read the following [document](./docs/source/architecture.md).

Installation

After installing the NVIDIA Container Toolkit, please run the following commands to install TensorRT-LLM for x86_64 users.

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

# Check installation…

Excerpt shown — open the source for the full document.

togethercomputer/TensorRT-LLM

Latest News

Table of Contents

TensorRT-LLM Overview

Installation