nvidia/nemotron-climb-proxy-models
Captured source
source ↗Model Overview
Description:
Nemotron-CLIMB Proxy Base Models (62M and 350M) are two small decoder-only transformer language models pre-trained from scratch by NVIDIA on 10 trillion tokens using the Megatron-LM codebase. They are designed as proxy models for scaling law research — enabling practitioners to forecast the behavior of much larger models prior to committing full-scale compute resources. Both models use a WSD (Warmup-Stable-Decay) learning rate schedule and share the same 32-layer architecture, differing only in hidden dimension. These models are ready for commercial/non-commercial use.
License/Terms of Use:
Released under the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
These proxy models are intended for ML researchers and engineers working on:
- Scaling law experiments — predicting loss, downstream accuracy, or emergent behavior of larger models from small-model trends.
- Recipe transfer — validating hyperparameter choices (learning rate, batch size, data mix) at low cost before scaling up.
- Proxy-tuning research — studying how fine-tuning dynamics (SFT, RLHF, DPO) transfer across model scales.
- Reward model proxy training — training lightweight reward models for alignment research.
References(s):
Model Architecture:
Architecture Type: Transformer (decoder-only)
Network Architecture: Decoder-only transformer with RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE).
Number of model parameters:
| Variant | Parameters | Layers | Checkpoint Size | |---|---|---|---| | 62M | 62 million | 32 | ~837 MB | | 350M | 350 million | 32 | ~4.5 GB |
Note: Checkpoint sizes include optimizer state and RNG state, suitable for continued pre-training.
Design Choices: Both models were trained from scratch using the Megatron-LM distributed training framework with the following key design decisions: 1. Deep-and-narrow architecture. Both variants use 32 transformer layers — unusually deep for their parameter count — to better approximate the layer-wise dynamics of billion-scale models, improving proxy fidelity for scaling law extrapolation. 2. WSD learning rate schedule. A Warmup-Stable-Decay schedule was used for stable long-horizon training over 10T tokens. 3. Single tensor-parallel rank. Both models were trained with TP=1 to simplify checkpoint distribution and downstream usage.
Input(s):
Input Type(s): Text
Input Format(s):
- Text: Token IDs (integer sequences)
Input Parameters:
- Text: One-Dimensional (1D) sequence of token IDs
Other Properties Related to Input: These are base (pre-trained) language models. Input is tokenized text. The models accept standard causal-LM input and are not instruction-tuned.
Output(s)
Output Type(s): Text
Output Format(s):
- Text: Next-token logits over vocabulary at each position
Output Parameters:
- Text: Two-Dimensional (2D) — sequence length x vocabulary size
Other Properties Related to Output: As base models, outputs are raw next-token probability distributions. The models are not aligned or instruction-tuned and may produce unfiltered text.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine:
- Megatron-LM (native checkpoint format)
- Can be converted to HuggingFace Transformers format for inference
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere (A100)
- NVIDIA Hopper (H100, H200)
- NVIDIA Lovelace (L40S)
- CPU inference is feasible given the small model size
Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s):
| Variant | Training Iterations | Training Tokens | Training Nodes | Checkpoint | |---|---|---|---|---| | 62M | 2,499,000 | 10T | 8 | iter_2499000/mp_rank_00/model_optim_rng.pt | | 350M | 2,384,053 | 10T | 16 | iter_2384053/mp_rank_00/model_optim_rng.pt |
Both are v1.0 releases.
Training, Testing, and Evaluation Datasets:
Training Dataset:
Data Modality:
- Text
Training Data Size:
Text Training Data Size: 1 trillion tokens
Data Collection Method by dataset:
- Automated
Labeling Method by dataset:
- Not Applicable
Properties: 1 trillion tokens. Content is English-language web text. The data may include publicly available web content of various types (articles, blogs, forums, etc.).
Testing Dataset:
Data Collection Method by dataset:
- Automated
Labeling Method by dataset:
- Not Applicable
Properties: 10 billion tokens. Same source distribution as training data.
Evaluation Dataset:
Data Collection Method by dataset:
- Automated
Labeling Method by dataset:
- Not Applicable
Properties: 10 billion tokens. Same source distribution as training data.
Inference:
Acceleration Engine: Megatron-LM or HuggingFace Transformers (after conversion)
Test Hardware:
- NVIDIA A100 / H100 GPU (also runnable on CPU given small size)
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Nvidia model release, moderate significance