What does this fork signal mean?

FriendliAI forked friendliai/TensorRT-LLM (forked from NVIDIA/TensorRT-LLM). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo friendliai/TensorRT-LLM · parent NVIDIA/TensorRT-LLM · Routine fork, minimal traction.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

FriendliAI Fork: friendliai/TensorRT-LLM

Captured source

source ↗

GitHub/github.com/friendliai/TensorRT-LLM

friendliai/TensorRT-LLM repository metadata

Source ↗

published May 23, 2025seen Jun 5captured Jun 11http 200method plain

friendliai/TensorRT-LLM

Description: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

Language: C++

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 0

Created: 2025-05-23T09:13:35Z

Pushed: 2025-06-23T06:58:26Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/TensorRT-LLM

Archived: no

README:

TensorRT-LLM =========================== A TensorRT Toolbox for Optimized Large Language Model Inference

[Architecture](./docs/source/torch/arch_overview.md) | [Performance](./docs/source/performance/perf-overview.md) | Examples | [Documentation](./docs/source/) | Roadmap

---

Tech Blogs

[06/19] Disaggregated Serving in TensorRT-LLM

✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)

[06/05] Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)

✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)

[05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers

✨ [➡️ link](./docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md)

[05/23] DeepSeek R1 MTP Implementation and Optimization

✨ [➡️ link](./docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md)

[05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs

✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)

Latest News

[06/17] Join NVIDIA and DeepInfra for a developer meetup on June 26 ✨ ➡️ link
[05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick

✨ ➡️ link

[04/10] TensorRT-LLM DeepSeek R1 performance benchmarking best practices now published.

✨ [➡️ link](./docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)

[04/05] TensorRT-LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

![L4_perf](./docs/source/media/l4_launch_perf.png)

[03/22] TensorRT-LLM is now fully open-source, with developments moved to GitHub!
[03/18] 🚀🚀 NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance with TensorRT-LLM ➡️ Link
[02/28] 🌟 NAVER Place Optimizes SLM-Based Vertical Services with TensorRT-LLM ➡️ Link

[02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell ➡️ Link

[02/20] Explore the complete guide to achieve great accuracy, high throughput, and low latency at the lowest cost for your business here.

[02/18] Unlock #LLM inference with auto-scaling on @AWS EKS ✨ ➡️ link

[02/12] 🦸⚡ Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

➡️ link

[02/12] 🌟 How Scaling Laws Drive Smarter, More Powerful AI

➡️ link

[01/25] Nvidia moves AI focus to inference cost, efficiency ➡️ link

[01/24] 🏎️ Optimize AI Inference Performance with NVIDIA Full-Stack Solutions ➡️ link

[01/23] 🚀 Fast, Low-Cost Inference Offers Key to Profitable AI ➡️ link

[01/16] Introducing New KV Cache Reuse Optimizations in TensorRT-LLM ➡️ link

[01/14] 📣 Bing's Transition to LLM/SLM Models: Optimizing Search with TensorRT-LLM ➡️ link

[01/04] ⚡Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding

➡️ link

Previous News

[2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview

➡️ link

[2024/12/03] 🌟 Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. ⚡Learn how in this technical deep dive

➡️ link

[2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer ⚡ A user-friendly GUI and tight integration with NVIDIA TensorRT that offers:

✅ Intuitive visualization of ONNX model graphs ✅ Quick tweaking of model architecture and parameters ✅ Detailed performance profiling with either ORT or TensorRT ✅ Easy building of TensorRT engines [➡️...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, minimal traction.