ForkBasetenBasetenpublished Mar 11, 2025seen 5d

basetenlabs/TensorRT-Model-Optimizer

forked from NVIDIA/Model-Optimizer

Open original ↗

Captured source

source ↗

basetenlabs/TensorRT-Model-Optimizer

Description: A unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Language: Python

License: Apache-2.0

Stars: 1

Forks: 0

Open issues: 4

Created: 2025-03-11T16:30:08Z

Pushed: 2025-12-15T22:23:42Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/Model-Optimizer

Archived: no

README:

______________________________________________________________________

NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, distillation, pruning, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA NeMo, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM.

Latest News

Previous News

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Fork with 1 star, trivial traction