friendliai/TensorRT-LLM
forked from NVIDIA/TensorRT-LLM
Captured source
source ↗friendliai/TensorRT-LLM
Description: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
Language: C++
License: Apache-2.0
Stars: 1
Forks: 0
Open issues: 0
Created: 2025-05-23T09:13:35Z
Pushed: 2025-06-23T06:58:26Z
Default branch: main
Fork: yes
Parent repository: NVIDIA/TensorRT-LLM
Archived: no
README:
TensorRT-LLM =========================== A TensorRT Toolbox for Optimized Large Language Model Inference
[Architecture](./docs/source/torch/arch_overview.md) | [Performance](./docs/source/performance/perf-overview.md) | Examples | [Documentation](./docs/source/) | Roadmap
---
Tech Blogs
- [06/19] Disaggregated Serving in TensorRT-LLM
✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)
- [06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)
✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
- [05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
✨ [➡️ link](./docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md)
- [05/23] DeepSeek R1 MTP Implementation and Optimization
✨ [➡️ link](./docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md)
- [05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)
Latest News
- [06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ ➡️ link
- [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
✨ ➡️ link
- [04/10] TensorRT-LLM DeepSeek R1 performance benchmarking best practices now published.
✨ [➡️ link](./docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
- [04/05] TensorRT-LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

- [03/22] TensorRT-LLM is now fully open-source, with developments moved to GitHub!
- [03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT-LLM ➡️ Link
- [02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT-LLM ➡️ Link
- [02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell ➡️ Link
- [02/20] Explore the complete guide to achieve great accuracy, high throughput, and low latency at the lowest cost for your business here.
- [02/18] Unlock #LLM inference with auto-scaling on @AWS EKS ✨ ➡️ link
- [02/12] 🦸⚡ Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
- [02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI
- [01/25] Nvidia moves AI focus to inference cost, efficiency ➡️ link
- [01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions ➡️ link
- [01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI ➡️ link
- [01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM ➡️ link
- [01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM ➡️ link
- [01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding
Previous News
- [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview
- [2024/12/03] 🌟 Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. ⚡Learn how in this technical deep dive
- [2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer ⚡ A user-friendly GUI and tight integration with NVIDIA TensorRT that offers:
✅ Intuitive visualization of ONNX model graphs ✅ Quick tweaking of model architecture and parameters ✅ Detailed performance profiling with either ORT or TensorRT ✅ Easy building of TensorRT engines [➡️…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork, minimal traction.