nvidia/Cosmos-AnomalyGen-Glass-2B
Captured source
source ↗Model Overview
Description:
Cosmos AnomalyGen — Mobile Phone Screen (UC3) generates synthetic mobile-phone-screen anomaly images by inpainting a user-supplied binary mask onto a clean reference screen image, conditioned on one of three trained defect types (oil, scratch, stain). The release ships only the few-shot-finetuned modules — a set of anomaly-token embeddings and a 2-layer MLP adapter — which plug into the frozen Cosmos-Predict2 2B Text-to-Image diffusion backbone (also using a frozen NV-DINOv2 mask encoder and a frozen T5 text encoder) at inference time. Cosmos AnomalyGen — UC3 v1.0.0 was developed by NVIDIA as part of the Cosmos AnomalyGen pipeline. This model is ready for commercial use.
License/Terms of Use:
Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement.
Deployment Geography:
Global
Use Case:
Industrial visual-inspection teams responsible for mobile-phone-screen QA who have very few (≤5 per defect type) real anomaly examples. The model produces large-scale synthetic anomaly datasets (clean screen + binary mask → realistic oil / scratch / stain image) for training downstream defect-detection or segmentation models, including downstream TAO toolkit consumers via the DAFT v3.0 export path.
Release Date:
Github 06/02/2026 via https://github.com/NVIDIA/paidf-anomalygen
References(s):
- Anomaly Diffusion (AAAI 2024) — paper: https://arxiv.org/abs/2312.05767, code: https://github.com/sjtuplayer/anomalydiffusion
- Cosmos-Predict2 — https://github.com/nvidia-cosmos/cosmos-predict2
- NV-DINOv2 classification model — https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/nv_dinov2_classification_model
Model Architecture:
Architecture Type: Transformer (diffusion DiT backbone with learnable conditioning modules)
Network Architecture:
anomaly_embedding*(trainable, included in this release)*: token embeddings (256 tokens per+pair) — three pairs trained for UC3:Phone+oil,Phone+scratch,Phone+stain.adapter*(trainable, included in this release)*: 2-layer MLP with GELU activations (input / output hidden size = 1024), projecting the mask encoder output into the diffusion DiT conditioning space.mask_encoder*(frozen, not redistributed in this release)*: NV-DINOv2 (ViT-L) backbone with adaptive pool (kernel = 7); weights are loaded from the separately downloaded NV-DINOv2 classification checkpoint at inference time.text_encoder*(frozen, not redistributed in this release)*: google-t5/t5-large.- These modules condition the frozen Cosmos-Predict2 2B T2I DiT denoiser at inference time.
This model was developed based on Cosmos-Predict2-2B-Text2Image.
Number of model parameters: Approximately 2.9×10^6 (2.9 million) trainable parameters in the released modules — anomaly_embedding ≈ 0.79M (256 tokens × 1024 hidden × 3 + pairs) plus the 2-layer MLP adapter ≈ 2.1M (1024→1024 with GELU). The trainable modules are distributed as the model/iter_000009000.pt checkpoint file. The frozen Cosmos-Predict2 2B base contributes ~2.0×10^9 (2 billion) parameters used at inference time but not redistributed in this release.
Input(s):
Input Type(s): Image, Binary Mask, Text
Input Format(s):
- Image: PNG / JPG, Red, Green, Blue (RGB)
- Binary Mask: PNG / JPG, single-channel binary (0 = background, 255 = anomaly region; binarized at threshold 127)
- Text: anomaly-type string in the form
+(one ofPhone+oil,Phone+scratch,Phone+stain)
Input Parameters:
- Image: Two-Dimensional (2D)
- Mask: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Other Properties Related to Input: Input clean image and paired mask must have the same dimensions; the model was trained at 512×512 and inference is run at the same resolution. anomaly_type must exactly match one of the three pairs trained for this UC3 checkpoint — passing an unsupported defect string is rejected by scripts/anomaly_gen/sdg-inference/validate_jsonl.py against this checkpoint's ag_config.yaml → dataloader_train.dataset.anomaly_types. The mask should ideally cover a contiguous defect region that resembles the trained mask distribution; the optional Automatic Mask Placement (AMP) tool can constrain placement to legal ROIs.
Output(s)
Output Type(s): Image
Output Format(s): PNG; Red, Green, Blue (RGB)
Output Parameters: Two-Dimensional (2D)
Other Properties Related to Output: 512×512 RGB synthetic anomaly image. Anomaly content is generated inside the user-supplied mask region; in the default crop_and_paste=True flow the inpainted patch is pasted back onto the clean reference image so non-masked pixels remain identical to the input. Optionally Poisson blending can be enabled. Generation metadata (per-sample guidance, crop_ratio, seed, etc.) is written to SDG_result.csv alongside the images.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
- PyTorch (via the Cosmos-Predict2 2B T2I pipeline)
- Cosmos AnomalyGen scripts (
scripts.anomaly_gen.synthetic_dataset_generation, torchrun-based) - NVIDIA TAO Toolkit — interop via DAFT v3.0 export (
scripts.anomaly_gen.convert_to_daft_format)
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere (A100)
- NVIDIA Hopper (H100)
- NVIDIA RTX 6000
Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
v1.0.0 — uc3-phone-screen-2b-512-iter9000 (trained 9,000 iterations; released artifact is the model/iter_000009000.pt checkpoint file containing finetuned modules only).
Training, Testing, and Evaluation Datasets:
Dataset Overview
- Total Number of Datasets: 2 (Roboflow…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Specialized model release by NVIDIA, not a major launch