ModelNVIDIANVIDIApublished Jun 2, 2026seen 5d

nvidia/4D-RGPT-8B

Open original ↗

Captured source

source ↗
published Jun 2, 2026seen 5dcaptured 9hhttp 200method plaintask video-text-to-textlicense cc-by-nc-4.0library transformersdownloads 154likes 15

Model Overview

Description:

4D-RGPT is a specialized multimodal large language model that improves region-level 4D (i.e., 3D + time) video understanding by distilling latent and explicit 4D perceptual signals (for example, depth and optical flow) from a frozen expert model into an NVILA-based student model. 4D-RGPT was developed by NVIDIA as part of the NVILA visual-language model family and introduces Perceptual 4D Distillation (P4D), Timestamp Positional Encoding (TPE), and the companion R4D-Bench benchmark for region-level 4D VQA.

This model is for research and development only.

License/Terms of Use:

Use of this model is governed by the CC-BY-NC-4.0 License.

Deployment Geography:

Global

Use Case:

Expected users are multimodal AI researchers, applied research teams, and developers studying video understanding, region grounding, 3D/4D reasoning, and physical AI. Representative use cases include region-level video question answering, model benchmarking, research on depth-and-time-aware MLLMs, and prototyping for domains such as robotics, autonomous driving, and industrial inspection.

Release Date:

Hugging Face [06/01/2026] via https://huggingface.co/nvidia/4D-RGPT-8B.

References(s):

  • Paper: https://arxiv.org/abs/2512.17012
  • GitHub: https://github.com/NVlabs/4D-RGPT
  • Project page: https://www.ca-joe-yang.com/resource/projects/4D_RGPT/
  • R4D-Bench: https://huggingface.co/datasets/nvidia/R4D-Bench

Model Architecture:

Architecture Type: Transformer

Network Architecture: NVILA-Lite-based MLLM using a SigLIP vision encoder, multimodal projector, and language model.

This model was developed based on: NVILA-Lite-based MLLM

Number of model parameters: 8.0\*10^9 for 4D-RGPT-8B

Describe design choices related to initialization techniques, hyperparameter tuning, regularization techniques, model optimization, damping, and training parameters: 4D-RGPT adds a lightweight training-only MLP 4D perception decoder (hidden size 2,560) with GELU activations, Xavier weight initialization, and zero bias initialization. Training begins from pretrained NVILA weights.Ttotal loss combines SFT, latent distillation, and explicit distillation with Timestamp Positional Encoding uses T=10,000.

Input(s):

Input Type(s): Image, Text, Video

Input Format(s):

  • Image: RGB
  • Text: String
  • Video: .mp4

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)
  • Video: Three-Dimensional (3D)

Other Properties Related to Input: The model is designed for video-question answering with explicit temporal cues encoded through timestamps for sampled frames. The paper uses sampled frame timestamps for TPE and, for fair comparison on R4D-Bench, evaluates open-source models using 16 sampled frames. Region-level evaluation uses region prompts represented through Set-of-Marks (SoM) or region masks in benchmark workflows.

Output(s)

Output Type(s): Text

Output Format(s):

  • Text: String

Output Parameters:

  • Text: One-Dimensional (1D)

Other Properties Related to Output: Outputs are text answers for 3D/4D VQA tasks, commonly multiple-choice selections, short phrases, or short numeric answers. The paper focuses on accuracy benchmarks rather than production API formatting. This model is designed to run on NVIDIA GPU-accelerated systems; the public training setup uses NVIDIA A100-SXM4-80GB GPUs.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Not Applicable (N/A)- inference using NVILA

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere — A100-SXM4-80GB

Supported Operating System(s):

  • [Linux]

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

  • 4D-RGPT-8B — main paper model based on NVILA-Lite-8B; this is the primary reported configuration in the main results tables.

Training, Testing, and Evaluation Datasets:

Dataset Overview

Total Size: Approximately 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture, based on the paper-reported counts. This corresponds to approximately 2.06e5 unique visual items (about 190k images plus about 16.2k videos).

Total Number of Datasets: 4 training datasets.

General description of data processing: The training mixture comprises VSTI-Bench training data, the NuScenes portion of Wolf, RoboFAC, and SAT. For evaluation, this release reports results on the companion R4D-Bench benchmark, VLM4D-real, and VSTI-Bench.

Public Datasets

Training datasets:

  • VSTI-Bench (training split): ~1.2k unique videos and ~130k QA pairs. Source videos are from ScanNet and ScanNet++.
  • Wolf (NuScenes portion): ~5k unique videos and ~15k QA pairs derived from dense captions.
  • RoboFAC: ~10k unique videos and ~65k conversations; simulated robotic-arm videos.
  • SAT (training split): ~190k unique simulated images and ~170k QA pairs.

Evaluation datasets:

  • R4D-Bench
  • VLM4D-real
  • VSTI-Bench

Training Dataset:

Data Modality:

  • [Image]
  • [Text]
  • [Video]

Training Data Size:

Approx. 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture.

Image Training Data Size

  • Less than a Million Images

Text Training Data Size

  • Less than a Billion Tokens

Video Training Data Size

  • Less than 10,000 Hours

Data Collection Method by dataset

  • Hybrid: Automatic/Sensors, Human, Synthetic

Labeling Method by dataset

  • Hybrid: Human, Automated, Synthetic…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction; routine model release.