What does this repo signal mean?

NVIDIA published NVIDIA/Model-Optimizer (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/Model-Optimizer · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure, Product and customer in the data-business radar.

NVIDIA Repo: NVIDIA/Model-Optimizer

Captured source

source ↗

GitHub/github.com/NVIDIA/Model-Optimizer

NVIDIA/Model-Optimizer repository metadata

Source ↗

published Apr 23, 2024seen 5dcaptured 8hhttp 200method plain

NVIDIA/Model-Optimizer

Description: A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, etc. to optimize inference speed.

Language: Python

License: Apache-2.0

Stars: 2902

Forks: 433

Open issues: 235

Created: 2024-04-23T19:00:54Z

Pushed: 2026-06-11T03:53:59Z

Default branch: main

Fork: no

Archived: no

README:

______________________________________________________________________

NVIDIA Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization [techniques](#techniques) including quantization, pruning, Neural Architecture Search (NAS), distillation, speculative decoding and sparsity to accelerate models.

[Input] Model Optimizer currently supports inputs of a Hugging Face, PyTorch or ONNX model.

[Optimize] Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint. Model Optimizer is also integrated with NVIDIA Megatron-Bridge, Megatron-LM and Hugging Face Accelerate for training required inference optimization techniques.

[Export for deployment] Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like SGLang, TensorRT-LLM, TensorRT, or vLLM. The unified Hugging Face export API now supports both transformers and diffusers models.

Latest News

[2026/05/27] [End-to-end optimization tutorial for Nemotron-3-Nano-30B-A3B](./examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16): Pruning + distillation (with long context extension) + FP8 quantization achieving 2.6× vLLM throughput and 2.6× memory reduction.
[2026/05/13] [Puzzletron](./examples/puzzletron): A new algorithm for heterogeneous pruning & NAS of LLM and VLM models.
[2026/04/15] Customer story: Domyn compresses Colosseum-355B → 260B using ModelOpt's Minitron pruning + distillation
[2026/03/17] Customer story: Bielik.AI builds Bielik Minitron 7B (33% smaller, 50% faster, 90% quality retained) using ModelOpt's Minitron pruning + distillation
[2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: FP8, NVFP4. Learn more in the Nemotron 3 Super release blog. Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
[2026/03/11] NeMo Megatron Bridge now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the Quantization (PTQ and QAT) guide for FP8/NVFP4 quantization and HF export instructions.
[2025/12/11] BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference
[2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
[2025/10/07] BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer
[2025/09/17] BLOG: An Introduction to Speculative Decoding for Reducing Latency in AI Inference
[2025/09/11] BLOG: How Quantization Aware Training Enables Low-Precision Accuracy Recovery
[2025/08/29] BLOG: Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training
[2025/08/01] BLOG: Optimizing LLMs for Performance and Accuracy with Post-Training Quantization
[2025/06/24] BLOG: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
[2025/05/14] NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs
[2025/04/21] Adobe optimized deployment using Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership
[2025/04/05] NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick. Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
[2025/03/18] World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell
[2025/02/25]…

Excerpt shown — open the source for the full document.

Notability

Scored, but no written rationale attached yet.

NVIDIA has a repo signal matching infrastructure, product and customer.

Infrastructure Product and customer