ForkFriendliAIFriendliAIpublished May 23, 2025seen 5d

friendliai/TensorRT-LLM

forked from NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published May 23, 2025seen 5dcaptured 9hhttp 200method plain

friendliai/TensorRT-LLM

Description: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

Language: C++

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 0

Created: 2025-05-23T09:13:35Z

Pushed: 2025-06-23T06:58:26Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/TensorRT-LLM

Archived: no

README:

TensorRT-LLM =========================== A TensorRT Toolbox for Optimized Large Language Model Inference

[Architecture](./docs/source/torch/arch_overview.md) | [Performance](./docs/source/performance/perf-overview.md) | Examples | [Documentation](./docs/source/) | Roadmap

---

Tech Blogs

  • [06/19] Disaggregated Serving in TensorRT-LLM

✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)

  • [06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)

✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)

  • [05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers

✨ [➡️ link](./docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md)

  • [05/23] DeepSeek R1 MTP Implementation and Optimization

✨ [➡️ link](./docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md)

  • [05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs

✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)

Latest News

  • [06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ ➡️ link
  • [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

➡️ link

  • [04/10] TensorRT-LLM DeepSeek R1 performance benchmarking best practices now published.

✨ [➡️ link](./docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)

  • [04/05] TensorRT-LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

![L4_perf](./docs/source/media/l4_launch_perf.png)

  • [03/22] TensorRT-LLM is now fully open-source, with developments moved to GitHub!
  • [03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT-LLM ➡️ Link
  • [02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT-LLM ➡️ Link
  • [02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell ➡️ Link
  • [02/20] Explore the complete guide to achieve great accuracy, high throughput, and low latency at the lowest cost for your business here.
  • [02/18] Unlock #LLM inference with auto-scaling on @AWS EKS ✨ ➡️ link
  • [02/12] 🦸⚡ Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

➡️ link

  • [02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI

➡️ link

  • [01/25] Nvidia moves AI focus to inference cost, efficiency ➡️ link
  • [01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions ➡️ link
  • [01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI ➡️ link
  • [01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM ➡️ link
  • [01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM ➡️ link
  • [01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding

➡️ link

Previous News

  • [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview

➡️ link

  • [2024/12/03] 🌟 Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. ⚡Learn how in this technical deep dive

➡️ link

  • [2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer ⚡ A user-friendly GUI and tight integration with NVIDIA TensorRT that offers:

✅ Intuitive visualization of ONNX model graphs ✅ Quick tweaking of model architecture and parameters ✅ Detailed performance profiling with either ORT or TensorRT ✅ Easy building of TensorRT engines [➡️…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, minimal traction.