RepoNVIDIANVIDIApublished Apr 23, 2024seen 5d

NVIDIA/Model-Optimizer

Python

Open original ↗

Captured source

source ↗
published Apr 23, 2024seen 5dcaptured 8hhttp 200method plain

NVIDIA/Model-Optimizer

Description: A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.

Language: Python

License: Apache-2.0

Stars: 2902

Forks: 433

Open issues: 235

Created: 2024-04-23T19:00:54Z

Pushed: 2026-06-11T03:53:59Z

Default branch: main

Fork: no

Archived: no

README:

______________________________________________________________________

NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, pruning, Neural Architecture Search (NAS), distillation, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA Megatron-Bridge, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM. The unified Hugging Face export API now supports both transformers and diffusers models.

Latest News

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

NVIDIA has a repo signal matching infrastructure, product and customer.