What does this model signal mean?

NVIDIA published nvidia/GR00T-H-N1.7. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license other · 229 HF downloads · NVIDIA's humanoid robot foundation model version 1.7. onlylabs links this event to 1 captured evidence page and 6 related model signals.

NVIDIA Model: nvidia/GR00T-H-N1.7

Captured source

source ↗

Hugging Face/huggingface.co/nvidia/GR00T-H-N1.7

nvidia/GR00T-H-N1.7 model card

Source ↗

published May 30, 2026seen Jun 6captured Jun 11http 200method plainlicense otherparams 2.9Bdownloads 229likes 6

GR00T-H-N1.7

Model Overview

Description:

GR00T-H-N1.7 is a post-trained variant of NVIDIA Isaac GR00T N1.7 for surgical robots. It builds on the GR00T N1.7 VLA foundation and adapts it using the Open-H embodiment dataset.

This model is ready for commercial use.

The neural network architecture is inherited from the GR00T N1.7 series of models, combining a vision-language foundation model with a diffusion transformer head that denoises continuous actions.

License/Terms of Use:

NVIDIA Open Model License

You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.

Deployment Geography:

Global

Use Case:

Researchers, Academics, Open-Source Community: Healthcare-focused robotics research and algorithm development.

Intended Use

GR00T-H-N1.7 is intended for use in robotics R&D, including exploration of surgical robotics and robotic ultrasound policies, benchmarking, and method development. It is not intended for clinical deployment, patient care, or medical decision-making.

References(s):

Open-H Paper: Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Base Model: GR00T-N1.7-3B
GR00T Website: NVIDIA Isaac GR00T
GR00T N1 White Paper: https://arxiv.org/abs/2503.14734
Cosmos-Reason2: NVIDIA. "Cosmos-Reason2: An Open, Customizable, Reasoning Vision Language Model." NVIDIA Documentation (2026).

Liu, Xingchao, and Chengyue Gong. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." The Eleventh International Conference on Learning Representations.

Flow Matching Policy:

Black, Kevin, et al. "pi0: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).

Model Architecture:

Architecture Type: Vision Transformer, Multilayer Perceptron, Flow matching Transformer

This model was developed based on GR00T N1.7.

Number of model parameters: 3B

GR00T-H-N1.7 uses Cosmos-Reason2-2B to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.

To model proprioception and a sequence of actions conditioned on observations, GR00T-H-N1.7 uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the Cosmos-Reason2-2B vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a Gaussian noise vector. At inference time, the policy first samples a Gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.

Network Architecture: !image/png The schematic diagram is shown in the illustration above. Red, Green, Blue (RGB) camera frames are processed through a pre-trained vision transformer (SigLip2). Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprio, inputs are padded to a configurable max length before feeding into the MLP. Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

Input(s):

Input Type(s):

Vision: Image Frames

State: Robot Proprioception

Language Instruction: Text

Input Format(s):

Vision: Variable number of image frames from robot cameras

State: Floating Point

Language Instruction: String

Input Parameters:

Vision - Two-Dimensional (2D) - Red, Green, Blue (RGB) image, any resolution

State: One-Dimensional (1D) - Floating number vector

Language Instruction: One-Dimensional (1D) - String

Output(s)

Output Type(s): Actions

Output Format Continuous-value vectors

Output Parameters: Two-Dimensional (2D)

Other Properties Related to Output: Continuous-value vectors correspond to different motor controls on a robot, which depends on Degrees of Freedom of the robot embodiment.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): PyTorch, TensorRT

Supported Hardware Microarchitecture Compatibility: All of the below:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Jetson
NVIDIA Lovelace

Supported Operating System:

Ubuntu

Model Version(s):

GR00T-H-N1.7, post-trained from GR00T N1.7

Training, Testing, and Evaluation Datasets:

Dataset Overview:

Full Open-H-Embodiment Dataset: 770 hours; 124,019 episodes; 119 datasets; 20 robot platforms; 50+ institutions

Post-Training Subset: 601 hours (real-world surgical tasks only); ~63,930 episodes; 58 datasets; 7 robot platforms

Dataset partition: Training 98%, Testing N/A (real-world robot evaluation only), Validation 2%

Training Data Summary

GR00T-H-N1.7 is adapted from the upstream GR00T N1.7 foundation model using an Open-H post-training phase. The full Open-H-Embodiment dataset contains 770 hours of paired video and kinematic data across 124,019 episodes with synchronized streams such as video, kinematics, force/torque, ultrasound, and domain-specific sensors. For post-training, a 601-hour real-world surgical subset of the full 770-hour corpus is used. Only real-world surgical datasets are used; ultrasound, endoscopy, and simulation data is left for future work. The Versius-500 contribution is capped at 20% of training steps to prevent any single embodiment from dominating the loss signal; remaining datasets are sampled proportionally to their size.

GR00T-H-N1.7 was trained on 7 robot platforms across 58 datasets: CMR Versius, dVRK, dVRK-Si, Rob Surgical...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low downloads, minor release