RepoNVIDIANVIDIApublished Dec 30, 2024seen 5d

NVIDIA/cosmos

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Dec 30, 2024seen 5dcaptured 13hhttp 200method plain

NVIDIA/cosmos

Description: NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.

Language: Jupyter Notebook

License: NOASSERTION

Stars: 9842

Forks: 637

Open issues: 15

Created: 2024-12-30T17:21:14Z

Pushed: 2026-06-10T00:26:22Z

Default branch: main

Fork: no

Archived: no

README:

Cosmos

Website | Framework | Models

Table of Contents

  • [Introduction](#introduction)
  • [Cosmos 3](#cosmos-3)
  • [Key Capabilities](#key-capabilities)
  • [Model Architecture](#model-architecture)
  • [Model Family](#model-family)
  • [Supported Generation Settings](#supported-generation-settings)
  • [Input and Output](#input-and-output)
  • [Use Cases](#use-cases)
  • [Generator](#generator)
  • [Reasoner](#reasoner)
  • [Quickstart](#quickstart)
  • [Generator with Diffusers](#generator-with-diffusers)
  • [Generator with vLLM-Omni](#generator-with-vllm-omni)
  • [Reasoner with Transformers](#reasoner-with-transformers)
  • [Reasoner with vLLM](#reasoner-with-vllm)
  • [Reasoner with NIM](#reasoner-with-nim)
  • [Troubleshooting](#troubleshooting)
  • [Which CUDA version should I use?](#which-cuda-version-should-i-use)
  • [Which base container should I use?](#which-base-container-should-i-use)
  • [torch.cuda.is_available() is False](#torchcudais_available-is-false-the-nvidia-driver-on-your-system-is-too-old)
  • [Import fails with libxcb.so.1: cannot open shared object file](#import-fails-with-libxcbso1-cannot-open-shared-object-file)
  • [uv errors on install or sync](#uv-errors-on-install-or-sync)
  • [Choosing an Integration](#choosing-an-integration)
  • [Examples](#examples)
  • [Inference Benchmarks](#inference-benchmarks)
  • [Finetune](#finetune)
  • [Limitations](#limitations)
  • [Ecosystem](#ecosystem)
  • [News](#news)
  • [License and Contact](#license-and-contact)

Introduction

NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.

Cosmos 3

Cosmos 3 is our newest model family [[Models]]() [[Report]](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf) [[Website]](https://research.nvidia.com/labs/cosmos-lab/cosmos3/). It is a suite of omnimodal world models designed to jointly process and generate language, images, video, audio, and action sequences within a unified Mixture-of-Transformers architecture. By supporting highly flexible input-output configurations, it seamlessly unifies critical modalities for Physical AI — effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework.

Cosmos 3 exposes two runtime surfaces:

| Surface | Inputs | Outputs | Use Cases | |----------|----------|----------|----------| | Reasoner | Text, vision | Text | World understanding, grounding, physical reasoning, task planning, action forecasting, embodied agent reasoning, and autonomous system decision making | | Generator | Text, vision, sound, action | Vision, sound, action | World generation, world simulation, future prediction, synthetic data generation, policy learning, and robot training |

Key Capabilities

  • World understanding: Analyze videos and images for captions, temporal events, next actions, spatial grounding, physical plausibility, and causal outcomes.
  • World generation: Produce images, videos, synchronized sound, and action-conditioned rollouts from text, image, video, or action inputs.
  • Action modeling: Predict policy actions, inverse dynamics, and forward dynamics for robotics, camera motion, egocentric motion, and autonomous-driving settings.
  • Research and production paths: Use Diffusers and Transformers for Python-first development, then vLLM-Omni and vLLM for OpenAI-compatible serving.
  • Post-training recipes: Adapt vision, action, and reasoner workflows with Cosmos Framework training recipes and task-specific evaluation [Coming Soon].

Model Architecture

![Cosmos 3 model architecture](cookbooks/cosmos3/cosmos3-model-architecture.png)

Cosmos 3 is an omnimodal world model built on a unified Mixture-of-Transformers (MoT) architecture that combines an autoregressive (AR) transformer for reasoning with a diffusion transformer (DM) for multimodal generation. In Reasoner Mode, language and visual understanding tokens are processed through causal self-attention, enabling next-token prediction for tasks such as perception, planning, and world reasoning. In Generator Mode, noisy image, video, audio, and action tokens are denoised through full attention, allowing the model to jointly generate coherent multimodal outputs. Both modes share the same transformer architecture, multimodal attention layers, and a unified 3D multi-dimensional rotary position embedding (mRoPE) representation that encodes spatial and temporal structure across modalities, enabling consistent reasoning over images, videos, audio streams, and action trajectories.

Model Family

| Model | Size | Primary Capability | |---------|---------:|---------| | [Cosmos3-Nano](https://huggingface.co/nvidia/Cosmos3-Nano) | 16B | Compact omnimodal world model for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI. | | [Cosmos3-Super](https://huggingface.co/nvidia/Cosmos3-Super) | 64B | Frontier-scale omnimodal world model for advanced multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI. | | [Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image) | 64B | High-fidelity text-to-image generation. | | [Cosmos3-Super-Image2Video](https://huggingface.co/nvidia/Cosmos3-Super-Image2Video) | 64B | Temporally coherent image-to-video generation. | | [Cosmos3-Nano-Policy-DROID](https://huggingface.co/nvidia/Cosmos3-Nano-Policy-DROID) | 16B | Vision-language robot policy for DROID manipulation and control. |

Supported Generation Settings

| Setting | Supported values | | ------------------| --------------------------------------- | | Resolution tiers | 256p, 480p, 720p, default=480p | | Aspect ratios | 16:9, 4:3, 1:1, 3:4, 9:16, default=16:9 | | Frame rates | 10, 16, 24, and 30 FPS, default=24 | | Frame count | 5 to 300 frames, default=189 | | Precision | BF16 tested | | Operating system | Linux | | GPU architectures | NVIDIA Ampere,…

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

High stars, major NVIDIA release

NVIDIA has a repo signal matching data demand, infrastructure.