ModelNVIDIANVIDIApublished May 15, 2026seen 5d

nvidia/nemotron-climb-proxy-models

Open original ↗

Captured source

source ↗
published May 15, 2026seen 5dcaptured 9hhttp 200method plainlicense otherdownloads 0likes 6

Model Overview

Description:

Nemotron-CLIMB Proxy Base Models (62M and 350M) are two small decoder-only transformer language models pre-trained from scratch by NVIDIA on 10 trillion tokens using the Megatron-LM codebase. They are designed as proxy models for scaling law research — enabling practitioners to forecast the behavior of much larger models prior to committing full-scale compute resources. Both models use a WSD (Warmup-Stable-Decay) learning rate schedule and share the same 32-layer architecture, differing only in hidden dimension. These models are ready for commercial/non-commercial use.

License/Terms of Use:

Released under the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

These proxy models are intended for ML researchers and engineers working on:

  • Scaling law experiments — predicting loss, downstream accuracy, or emergent behavior of larger models from small-model trends.
  • Recipe transfer — validating hyperparameter choices (learning rate, batch size, data mix) at low cost before scaling up.
  • Proxy-tuning research — studying how fine-tuning dynamics (SFT, RLHF, DPO) transfer across model scales.
  • Reward model proxy training — training lightweight reward models for alignment research.

References(s):

Model Architecture:

Architecture Type: Transformer (decoder-only)

Network Architecture: Decoder-only transformer with RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE).

Number of model parameters:

| Variant | Parameters | Layers | Checkpoint Size | |---|---|---|---| | 62M | 62 million | 32 | ~837 MB | | 350M | 350 million | 32 | ~4.5 GB |

Note: Checkpoint sizes include optimizer state and RNG state, suitable for continued pre-training.

Design Choices: Both models were trained from scratch using the Megatron-LM distributed training framework with the following key design decisions: 1. Deep-and-narrow architecture. Both variants use 32 transformer layers — unusually deep for their parameter count — to better approximate the layer-wise dynamics of billion-scale models, improving proxy fidelity for scaling law extrapolation. 2. WSD learning rate schedule. A Warmup-Stable-Decay schedule was used for stable long-horizon training over 10T tokens. 3. Single tensor-parallel rank. Both models were trained with TP=1 to simplify checkpoint distribution and downstream usage.

Input(s):

Input Type(s): Text

Input Format(s):

  • Text: Token IDs (integer sequences)

Input Parameters:

  • Text: One-Dimensional (1D) sequence of token IDs

Other Properties Related to Input: These are base (pre-trained) language models. Input is tokenized text. The models accept standard causal-LM input and are not instruction-tuned.

Output(s)

Output Type(s): Text

Output Format(s):

  • Text: Next-token logits over vocabulary at each position

Output Parameters:

  • Text: Two-Dimensional (2D) — sequence length x vocabulary size

Other Properties Related to Output: As base models, outputs are raw next-token probability distributions. The models are not aligned or instruction-tuned and may produce unfiltered text.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine:

  • Can be converted to HuggingFace Transformers format for inference

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere (A100)
  • NVIDIA Hopper (H100, H200)
  • NVIDIA Lovelace (L40S)
  • CPU inference is feasible given the small model size

Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

| Variant | Training Iterations | Training Tokens | Training Nodes | Checkpoint | |---|---|---|---|---| | 62M | 2,499,000 | 10T | 8 | iter_2499000/mp_rank_00/model_optim_rng.pt | | 350M | 2,384,053 | 10T | 16 | iter_2384053/mp_rank_00/model_optim_rng.pt |

Both are v1.0 releases.

Training, Testing, and Evaluation Datasets:

Training Dataset:

Data Modality:

  • Text

Training Data Size:

Text Training Data Size: 1 trillion tokens

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Not Applicable

Properties: 1 trillion tokens. Content is English-language web text. The data may include publicly available web content of various types (articles, blogs, forums, etc.).

Testing Dataset:

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Not Applicable

Properties: 10 billion tokens. Same source distribution as training data.

Evaluation Dataset:

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Not Applicable

Properties: 10 billion tokens. Same source distribution as training data.

Inference:

Acceleration Engine: Megatron-LM or HuggingFace Transformers (after conversion)

Test Hardware:

  • NVIDIA A100 / H100 GPU (also runnable on CPU given small size)

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Nvidia model release, moderate significance