ReleaseNVIDIANVIDIApublished May 4, 2026seen 5d

NVIDIA/warp v1.13.0

NVIDIA/warp

Open original ↗

Captured source

source ↗
published May 4, 2026seen 5dcaptured 11hhttp 200method plain

v1.13.0

Repository: NVIDIA/warp

Tag: v1.13.0

Published: 2026-05-04T04:52:26Z

Prerelease: no

Release notes:

Warp v1.13.0

Warp v1.13 introduces experimental graph capture serialization with CPU replay, letting captured simulations roundtrip through a portable .wrp file and load from standalone C++ on either GPU or CPU. It also adds an experimental cuBQL BVH backend for wp.Mesh that accelerates ray-heavy mesh queries, the wp.bfloat16 scalar type, a pluggable CUDA allocator interface with built-in RAPIDS Memory Manager (RMM) integration, scoped memory tracking with C++-layer call-site attribution, and a batch of new tile primitives (tile_dot, tile_axpy, tile_stack, scatter helpers).

New features

Graph capture serialization and CPU replay

> [!IMPORTANT] > This is an experimental feature. The API may change without a formal deprecation cycle.

Warp v1.13 introduces a portable serialized-graph format. Operations recorded during wp.capture_begin(apic=True) / wp.capture_end() can be saved to a .wrp file with wp.capture_save() and replayed from either Python or standalone C++ via wp.capture_load(), enabling cross-process and cross-language graph reuse (#1349). CPU graph capture is also new in this release: the same wp.Graph object now replays on CPU through wp.capture_launch(), and the underlying APIC operation log is what gets serialized. A new wp.handle (a uint64 alias) carries wp.Mesh handles across save and load so kernels can keep referencing meshes after deserialization.

import warp as wp

with wp.ScopedDevice("cpu"):
a = wp.zeros(64, dtype=float)
b = wp.zeros(64, dtype=float)

wp.capture_begin(apic=True)
wp.copy(b, a)
graph = wp.capture_end()

wp.capture_save(graph, "demo", inputs={"a": a}, outputs={"b": b})

# Later (in the same process or a fresh one): replay from disk.
with wp.ScopedDevice("cpu"):
loaded = wp.capture_load("demo")
loaded.set_param("a", wp.array([1.0] * 64, dtype=float))
wp.capture_launch(loaded)

Loading and replaying from standalone C++ (CPU device shown). The full example also walks the _modules/ directory, loads each .o via wp_load_obj, resolves kernel symbols, and registers them with wp_apic_register_loaded_cpu_kernel before the first replay. The snippet below elides that boilerplate:

#include "apic.h"
#include "warp.h"

wp_init(nullptr);
APICGraph graph = wp_apic_load_graph(nullptr, "demo.wrp", 1); // 1 = CPU device

// (Walk demo_modules/, load each .o, and register kernels. See linked example.)

wp_apic_set_param(graph, "a", a_buffer, a_size);
wp_apic_cpu_replay_graph(graph); // For CUDA: cudaGraphLaunch(wp_apic_get_cuda_graph_exec(graph), stream)
wp_apic_get_param(graph, "b", b_buffer, b_size);

wp_apic_destroy_graph(graph);

See `warp/examples/cpp/02_apic_visualization` (CUDA replay) and `warp/examples/cpp/03_apic_visualization_cpu` (CPU replay) for end-to-end demos with OpenGL visualization.

What gets written:

demo.wrp # operation byte stream + region snapshots + metadata
demo_modules/
.cubin / .meta # one per CUDA kernel module, arch-pinned
.o # one per CPU kernel module (CPU capture)

Key capabilities:

  • `wp.capture_save(graph, path, inputs=..., outputs=...)` registers named bindings so the consumer side can swap in fresh inputs and read outputs by name without touching the graph topology.
  • `wp.capture_load()` + `wp.capture_launch()` support replay on both CPU and CUDA. Loaded graphs expose set_param, get_param, and get_param_ptr for each registered binding, plus params and is_loaded properties on wp.Graph.
  • `wp.handle` scalar type and `wp.Mesh` remap let kernels accept mesh handles whose underlying objects are reconstructed on load. APIC walks @wp.struct fields recursively to find handle pointers and remap them.

Stability and known gaps:

API Capture is experimental, and we plan to keep adding capabilities and closing gaps over future releases (tracker: #1388). For now, regenerate .wrp artifacts when upgrading Warp. The current operation set, handle types, and platform constraints are documented in the Graphs section of the user guide.

cuBQL BVH backend for wp.Mesh

> [!IMPORTANT] > This is an experimental feature. The API may change without a formal deprecation cycle.

wp.Mesh now accepts bvh_constructor="cubql" to build its acceleration structure with cuBQL, an Apache 2.0-licensed header-only CUDA library for fast BVH construction and traversal (#1286). For ray-heavy workloads on dense static meshes, where the existing SAH builder's exhaustive construction dominates setup time and where ray traversal sits on the simulation hot path, cuBQL typically delivers faster ray queries alongside consistently lower build times than the SAH, median, and LBVH builders. As one specific data point, a Warp-based renderer benchmark on an RTX 4090 (Franka Emika Panda visual mesh, 8192 parallel worlds) saw simulation time drop from 1.41 s to 0.98 s after switching the constructor with no other changes. Speedups depend heavily on mesh size, query mix, and how much of the frame the mesh queries occupy, so benchmark on your own scene before relying on a particular win.

The "cubql" backend currently only routes wp.mesh_query_ray() through cuBQL's traversal kernels. Extending it to point queries, AABB queries, grouped queries, and winding-number support is future work. Today, passing groups=... or support_winding_number=True to a cuBQL wp.Mesh raises a RuntimeError at construction. Calling wp.mesh_query_point_* or wp.mesh_query_aabb_* against a cuBQL mesh silently returns no results. Stick with the default SAH/median/LBVH builders for kernels that mix query types or aren't ray-bound.

import warp as wp

mesh = wp.Mesh(
points=points, # wp.array of wp.vec3
indices=tri_indices, # wp.array of wp.int32, shape (num_tris * 3,)
bvh_constructor="cubql",
)

Pluggable CUDA allocator and RMM integration

CUDA device-memory allocations can now be routed through any object implementing the wp.Allocator protocol via wp.set_cuda_allocator(), wp.set_device_allocator(), or scoped…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Routine version update of GPU framework