ForkDeepInfraDeepInfrapublished Jan 26, 2026seen 5d

deepinfra/dynamo

forked from ai-dynamo/dynamo

Open original ↗

Captured source

source ↗
published Jan 26, 2026seen 5dcaptured 14hhttp 200method plain

deepinfra/dynamo

Description: A Datacenter Scale Distributed Inference Serving Framework

Language: Rust

License: NOASSERTION

Stars: 1

Forks: 0

Open issues: 0

Created: 2026-01-26T23:19:52Z

Pushed: 2026-05-21T00:06:59Z

Default branch: main

Fork: yes

Parent repository: ai-dynamo/dynamo

Archived: no

README:

![Dynamo banner](./docs/assets/img/dynamo-frontpage-banner.png)

![Ask DeepWiki](https://deepwiki.com/ai-dynamo/dynamo) ![Discord](https://discord.gg/D92uqZRjCZ)

| [Docs](https://docs.nvidia.com/dynamo/) | [Roadmap](https://github.com/ai-dynamo/dynamo/issues/5506) | [Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) | [Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples) | [Prebuilt Containers](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) | [Digest](docs/digest/index.mdx) | [Design Proposals](https://github.com/ai-dynamo/enhancements) | [How to Contribute](#community-and-contributing) |

Dynamo

> [!NOTE] > Day-0 DeepSeek-V4 recipes available. Tested Kubernetes deployment paths for [DeepSeek-V4-Pro](recipes/deepseek-v4/deepseek-v4-pro/) and [DeepSeek-V4-Flash](recipes/deepseek-v4/deepseek-v4-flash/) are merged to main on both vLLM and SGLang, with a prebuilt SGLang container image published on NGC.

The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn't replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.

Built in Rust for performance, Python for extensibility.

When to use Dynamo

  • You're serving LLMs across multiple GPUs or nodes and need to coordinate them
  • You want KV-aware routing to avoid redundant prefill computation
  • You need to independently scale prefill and decode (disaggregated serving)
  • You want automatic scaling that meets latency SLAs at minimum total cost of ownership (TCO)
  • You need fast cold-starts when spinning up new replicas

If you're running a single model on a single GPU, your inference engine alone is probably sufficient.

Feature support at a glance:

| | SGLang | TensorRT-LLM | vLLM | |---|:----:|:----------:|:--:| | **Disaggregated Serving** | ✅ | ✅ | ✅ | | **KV-Aware Routing** | ✅ | ✅ | ✅ | | **SLA-Based Planner** | ✅ | ✅ | ✅ | | **KVBM** | 🚧 | ✅ | ✅ | | **Multimodal** | ✅ | ✅ | ✅ | | [Tool Calling](docs/tool-calling/README.md) | ✅ | ✅ | ✅ |

> [Full Feature Matrix →](https://docs.nvidia.com/dynamo/resources/feature-matrix) — LoRA, request migration, speculative decoding, and feature interactions.

Key Results

| Result | Context | |--------|---------| | 7x higher throughput per GPU | DeepSeek R1 on GB200 NVL72 w/ Dynamo vs B200 without (InferenceX) | | 7x faster model startup | ModelExpress weight streaming (DeepSeek-V3 on H200) | | 2x faster time to first token | KV-aware routing, Qwen3-Coder 480B (Baseten benchmark) | | 80% fewer SLA breaches | Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025 @ 2:50:00) | | 750x higher throughput | DeepSeek-R1 on GB300 NVL72 (InferenceXv2) |

What Dynamo Does

Most inference engines optimize a single GPU or a single node. Dynamo is the orchestration layer above them — it turns a cluster of GPUs into a coordinated inference system.

[Architecture Deep Dive →](https://docs.nvidia.com/dynamo/design-docs/overall-architecture)

Core Capabilities

| Capability | What it does | Why it matters | |------------|-------------|----------------| | **Disaggregated Prefill/Decode** | Separates prefill and decode into independently scalable GPU pools | Maximizes GPU utilization; each phase runs on hardware tuned for its workload | | **KV-Aware Routing** | Routes requests based on worker load and KV cache overlap | Eliminates redundant prefill computation — 2x faster TTFT | | **KV Block Manager (KVBM)** | Offloads KV cache across GPU → CPU → SSD → remote storage | Extends effective context length beyond GPU memory | | **ModelExpress** | Streams model weights GPU-to-GPU via NIXL/NVLink | 7x faster cold-start for new replicas | | **Planner** | SLA-driven autoscaler that profiles workloads and right-sizes pools | Meets latency targets at minimum total cost of ownership (TCO) | | **Grove** | K8s operator for topology-aware gang scheduling (NVL72) | Places workloads optimally across racks, hosts, and NUMA nodes | | **AIConfigurator** | Simulates 10K+ deployment configs in seconds | Finds optimal serving config without burning GPU-hours | | **Fault Tolerance** | Canary health checks + in-flight request migration | Workers fail; user requests don't |

New in 1.0

  • Zero-config deploy ([DGDR](https://docs.nvidia.com/dynamo/kubernetes-deployment/deployment-guide/dgdr-reference)) *(beta):* Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys
  • Agentic inference: Per-request hints for latency priority, expected…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork with negligible traction.