RepoNVIDIANVIDIApublished Aug 16, 2023seen 5d

NVIDIA/TensorRT-LLM

Python

Open original ↗

Captured source

source ↗
published Aug 16, 2023seen 5dcaptured 10hhttp 200method plain

NVIDIA/TensorRT-LLM

Description: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

Language: Python

License: NOASSERTION

Stars: 13850

Forks: 2462

Open issues: 1368

Created: 2023-08-16T17:14:27Z

Pushed: 2026-06-11T03:24:37Z

Default branch: main

Fork: no

Archived: no

README:

TensorRT LLM =========================== TensorRT LLM optimizes inference for LLMs and Visual Gen models with specialized kernels for common operations, an efficient runtime, and a pythonic framework that enables you to customize and extend the system.

![Ask DeepWiki](https://deepwiki.com/NVIDIA/TensorRT-LLM)

Architecture | Performance | Examples | Documentation | Roadmap

---

Tech Blogs

  • [05/15] Joint Optimization of Agent Applications and TensorRT-LLM

➡️ link

  • [04/03] Tuning CUDA Graph Batch Sizes for Higher Output Throughput

➡️ link

  • [04/03] DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

➡️ link

  • [03/16] Optimizing MoE Communication with One-Sided AlltoAll Over NVLink

➡️ link

  • [03/04] Sparse Attention in TensorRT LLM

➡️ link

  • [02/06] Accelerating Long-Context Inference with Skip Softmax Attention

➡️ link

  • [01/09] Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs

➡️ link

Previous Blogs

  • [10/13] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)

➡️ link

  • [09/26] Inference Time Compute Implementation in TensorRT LLM

➡️ link

  • [09/19] Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly

➡️ link

  • [08/29] ADP Balance Strategy

➡️ link

  • [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT LLM

➡️ link

  • [08/01] Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)

➡️ link

  • [07/26] N-Gram Speculative Decoding in TensorRT LLM

➡️ link

  • [06/19] Disaggregated Serving in TensorRT LLM

➡️ link

  • [06/05] Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)

➡️ link

  • [05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers

➡️ link

  • [05/23] DeepSeek R1 MTP Implementation and Optimization

➡️ link

  • [05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs

➡️ link

Latest News

  • [04/03] 🎨 TensorRT LLM now supports diffusion models for visual generation ➡️ link

Previous News

  • [08/05] 🌟 TensorRT LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B ➡️ link and GPT-OSS-20B ➡️ link
  • [07/15] 🌟 TensorRT LLM delivers Day-0 support for LG AI Research's latest model, EXAONE 4.0 ➡️ link
  • [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

➡️ link

  • [04/10] TensorRT LLM DeepSeek R1 performance benchmarking best practices now published.

➡️ link

  • [04/05] TensorRT LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!…

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

NVIDIA has a repo signal matching infrastructure, product and customer.