nvidia/Harmonizer
Captured source
source ↗Harmonizer | Model Card
**Paper** | **Project Page** | **Code** | **Model** | **Data**
Description
Harmonizer is a single-step image diffusion model trained as an online generative enhancer for neural-reconstruction image and video renderings. It transforms imperfect novel-view renderings produced by Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) reconstructions into temporally consistent outputs that are closer to real captures, while correcting illumination, shadow, and reconstruction-artifact issues that arise when dynamic objects are composited into reconstructed scenes.
Harmonizer supports two operation modes:
- Offline mode: Used during the reconstruction phase to clean up pseudo-training views rendered from the reconstruction, then distill them back into 3D. This enhances underconstrained regions and improves overall 3D representation quality.
- Online mode: Acts as a single-step neural enhancer during simulation or inference. It harmonizes color and lighting, reconstructs missing or inconsistent shadows for inserted dynamic objects, and removes residual reconstruction artifacts from imperfect 3D supervision and current reconstruction-model capacity limits.
Harmonizer is designed as a single model compatible with both NeRF and 3DGS representations. The model was trained on data curated with 3DGUT-based reconstructions and is adaptable to Gaussian Splatting scenes.
License/Terms of Use
Governing Terms
Use of this model is governed by the NVIDIA Open Model License Agreement.
Deployment Geography: Global
Release Management
The model artifacts are released in this repository. Training and inference code is available from the Harmonizer GitHub repository. The associated dataset is available from nvidia/Harmonizer-Dataset.
Use Case
Harmonizer is intended for Physical AI developers looking to enhance and harmonize neural-reconstruction pipelines for autonomous-vehicle simulation. The model takes an image or image sequence as input and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.
Benchmark Results
Benchmarks were evaluated on 864 images from NDAS MLMCF and ParkNet training sessions. PSNR is higher-is-better; LPIPS and FID are lower-is-better.
| Model | PSNR | LPIPS | FID | | :---- | ----: | ----: | ----: | | Difix3D+ | 28.33 | 0.16 | 54.20 | | Fixer: cosmos_3dgut | 30.99 | 0.16 | 41.87 | | Harmonizer: non-temporal mode (fastest runtime; --enable-harmonizer in NuRec gRPC)
Inference enabled through the following checkpoints: harmonizer_nontemporal.pt diffusion_harmonizer.pkl with --nontemporal flag | 30.48 | 0.16 | 32.05 | | Harmonizer: temporal mode (highest quality output)
Inference enabled through the following checkpoint: diffusion_harmonizer.pkl | 31.06 | 0.15 | 27.40 |
Release Date
V1: June 2026
Reference(s)
- DiffusionHarmonizer paper
- DiffusionHarmonizer project page
- Harmonizer training and inference code
- Harmonizer dataset
Model Architecture
Architecture Type: Diffusion Transformer
Network Architecture: Diffusion Transformer, based on Cosmos Predict2 0.6B, post-trained as a single-step, temporally conditioned image-to-image enhancer for neural-reconstruction renderings.
The project page describes the backbone as the CosmosPredict2 0.6B text-to-image model fine-tuned on real-world and simulation training pairs from scalable data-curation pipelines for color and lighting harmonization, shadow correction, and artifact correction.
Model Input
Input Type(s): Image / Image sequence
Input Format: Red, Green, Blue (RGB)
Input Parameters: Two-Dimensional (2D)
Other Properties Related to Input: Specific resolution: 576 px x 1024 px
Model Output
Output Type(s): Image
Output Format: Red, Green, Blue (RGB)
Output Parameters: Two-Dimensional (2D)
Other Properties Related to Output: Specific resolution: 576 px x 1024 px
Software Integration
Runtime Engine(s): PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
- NVIDIA Blackwell
Preferred/Supported Operating System(s): Linux
NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and software frameworks such as CUDA libraries, the model can achieve faster training and inference times compared to CPU-only systems.
Model Version
Harmonizer-cosmos-0.6B
We release two checkpoints specified below.
1. `diffusion_harmonizer.pkl` — The temporally-conditioned Harmonizer checkpoint reported in the DiffusionHarmonizer paper. Recommended when temporal coherence across consecutive rendered frames is required (e.g., video-style novel-view simulation). The model supports non-temporally faster conditioned inference mode via --nontemporal flag.
Inference speed on H100:
- full model (default): 212 ms / 576 x 1024 px image
--nontemporalmode: 28 ms / 576 × 1024 px image
2. `harmonizer_nontemporal.pt` — Exported JIT model for non-temporal, per-image inference. The checkpoint does not support conditioning on previous frames and corresponds to diffusion_harmonizer.pkl with --nontemporal flag. Recommended for per-image enhancement use cases where neighboring-frame context is unavailable or unnecessary, or where speed is critical.
Inference speed on H100: 28 ms / 576 × 1024 px image.
Pretrained checkpoints are hosted on Hugging Face under nvidia/Harmonizer. To download all released checkpoints into a local models/ directory:
hf download nvidia/Harmonizer --local-dir models
Refer to the [code…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low downloads, routine release