ModelNVIDIANVIDIApublished May 19, 2026seen 5d

nvidia/Harmonizer

Open original ↗

Captured source

source ↗
published May 19, 2026seen 5dcaptured 14hhttp 200method plaintask image-to-imagelicense otherlibrary pytorchdownloads 171likes 13

Harmonizer | Model Card

**Paper** | **Project Page** | **Code** | **Model** | **Data**

Description

Harmonizer is a single-step image diffusion model trained as an online generative enhancer for neural-reconstruction image and video renderings. It transforms imperfect novel-view renderings produced by Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) reconstructions into temporally consistent outputs that are closer to real captures, while correcting illumination, shadow, and reconstruction-artifact issues that arise when dynamic objects are composited into reconstructed scenes.

Harmonizer supports two operation modes:

  • Offline mode: Used during the reconstruction phase to clean up pseudo-training views rendered from the reconstruction, then distill them back into 3D. This enhances underconstrained regions and improves overall 3D representation quality.
  • Online mode: Acts as a single-step neural enhancer during simulation or inference. It harmonizes color and lighting, reconstructs missing or inconsistent shadows for inserted dynamic objects, and removes residual reconstruction artifacts from imperfect 3D supervision and current reconstruction-model capacity limits.

Harmonizer is designed as a single model compatible with both NeRF and 3DGS representations. The model was trained on data curated with 3DGUT-based reconstructions and is adaptable to Gaussian Splatting scenes.

License/Terms of Use

Governing Terms

Use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography: Global

Release Management

The model artifacts are released in this repository. Training and inference code is available from the Harmonizer GitHub repository. The associated dataset is available from nvidia/Harmonizer-Dataset.

Use Case

Harmonizer is intended for Physical AI developers looking to enhance and harmonize neural-reconstruction pipelines for autonomous-vehicle simulation. The model takes an image or image sequence as input and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.

Benchmark Results

Benchmarks were evaluated on 864 images from NDAS MLMCF and ParkNet training sessions. PSNR is higher-is-better; LPIPS and FID are lower-is-better.

| Model | PSNR | LPIPS | FID | | :---- | ----: | ----: | ----: | | Difix3D+ | 28.33 | 0.16 | 54.20 | | Fixer: cosmos_3dgut | 30.99 | 0.16 | 41.87 | | Harmonizer: non-temporal mode (fastest runtime; --enable-harmonizer in NuRec gRPC)

Inference enabled through the following checkpoints: harmonizer_nontemporal.pt diffusion_harmonizer.pkl with --nontemporal flag | 30.48 | 0.16 | 32.05 | | Harmonizer: temporal mode (highest quality output)

Inference enabled through the following checkpoint: diffusion_harmonizer.pkl | 31.06 | 0.15 | 27.40 |

Release Date

V1: June 2026

Reference(s)

Model Architecture

Architecture Type: Diffusion Transformer

Network Architecture: Diffusion Transformer, based on Cosmos Predict2 0.6B, post-trained as a single-step, temporally conditioned image-to-image enhancer for neural-reconstruction renderings.

The project page describes the backbone as the CosmosPredict2 0.6B text-to-image model fine-tuned on real-world and simulation training pairs from scalable data-curation pipelines for color and lighting harmonization, shadow correction, and artifact correction.

Model Input

Input Type(s): Image / Image sequence

Input Format: Red, Green, Blue (RGB)

Input Parameters: Two-Dimensional (2D)

Other Properties Related to Input: Specific resolution: 576 px x 1024 px

Model Output

Output Type(s): Image

Output Format: Red, Green, Blue (RGB)

Output Parameters: Two-Dimensional (2D)

Other Properties Related to Output: Specific resolution: 576 px x 1024 px

Software Integration

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Hopper
  • NVIDIA Blackwell

Preferred/Supported Operating System(s): Linux

NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and software frameworks such as CUDA libraries, the model can achieve faster training and inference times compared to CPU-only systems.

Model Version

Harmonizer-cosmos-0.6B

We release two checkpoints specified below.

1. `diffusion_harmonizer.pkl` — The temporally-conditioned Harmonizer checkpoint reported in the DiffusionHarmonizer paper. Recommended when temporal coherence across consecutive rendered frames is required (e.g., video-style novel-view simulation). The model supports non-temporally faster conditioned inference mode via --nontemporal flag.

Inference speed on H100:

  • full model (default): 212 ms / 576 x 1024 px image
  • --nontemporal mode: 28 ms / 576 × 1024 px image

2. `harmonizer_nontemporal.pt` — Exported JIT model for non-temporal, per-image inference. The checkpoint does not support conditioning on previous frames and corresponds to diffusion_harmonizer.pkl with --nontemporal flag. Recommended for per-image enhancement use cases where neighboring-frame context is unavailable or unnecessary, or where speed is critical.

Inference speed on H100: 28 ms / 576 × 1024 px image.

Pretrained checkpoints are hosted on Hugging Face under nvidia/Harmonizer. To download all released checkpoints into a local models/ directory:

hf download nvidia/Harmonizer --local-dir models

Refer to the [code…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low downloads, routine release