What does this repo signal mean?

NVIDIA published NVIDIA/recsys-examples (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/recsys-examples · language Python · Solid recsys examples repo with moderate stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure in the data-business radar.

NVIDIA Repo: NVIDIA/recsys-examples

Captured source

source ↗

GitHub/github.com/NVIDIA/recsys-examples

NVIDIA/recsys-examples repository metadata

Source ↗

published Apr 18, 2025seen 1wcaptured 1whttp 200method plain

NVIDIA/recsys-examples

Description: Examples for Recommenders - easy to train and deploy on accelerated infrastructure.

Language: Python

License: NOASSERTION

Stars: 279

Forks: 71

Open issues: 44

Created: 2025-04-18T20:57:50Z

Pushed: 2026-06-16T04:23:02Z

Default branch: main

Fork: no

Archived: no

README:

NVIDIA RecSys Examples

Overview

NVIDIA RecSys Examples is a collection of optimized recommender models and components.

The project includes:

Examples for large-scale HSTU ranking and retrieval training through TorchRec and Megatron-Core integration
HSTU inference with paged KV cache, Triton Inference Server integration, CUDA graph usage, and C++ deployment with AOTInductor ([guide](./examples/hstu/inference/README.md))
Examples for semantic-id based retrieval model through TorchRec and Megatron-Core integration
DynamicEmb for model-parallel dynamic embedding tables with zero-collision hashing, eviction, admission control, table fusion, and TorchRec integration ([documentation](./corelib/dynamicemb/README.md))

What's New

[2026/5/20] 🎉v26.04 released!
Refactors the previous async KV-cache manager into a standalone [RecSys KVCache Manager package](corelib/recsys_kvcache_manager/), a new FlexKV backend for multi-node/multi-tier KV storage, LLM-style KV APIs, and updated HSTU inference examples.
Introduces a new [beam-search decode attention kernel](./corelib/gr_decode_atten/) and CuTe kernels plus a generate_beam_decode() entry point, enabling more efficient KV-cache-based beam generation for the SID-GR model with vectorized masking utilities.
[2026/4/14] 🎉v26.03 released!
We added Torch export and AOTInductor packaging for end-to-end HSTU C++ inference. See the [HSTU inference overview](./examples/hstu/inference/README.md) and the [C++ inference guide](./examples/hstu/inference/GUIDE_TO_RUN_CPP_INFERENCE_DEMO.md).
We improved DynamicEmb with table fusion and expansion, relaxed embedding-table alignment (no longer power-of-two), and capacity sizing aligned to bucket_capacity. See [DynamicEmb](./corelib/dynamicemb/README.md).
We added an HSTU end-to-end training benchmark suite with progressive optimizations. See the [HSTU training benchmark](./examples/hstu/training/benchmark/README.md) and [E2E benchmark notes](./examples/hstu/training/benchmark/E2E_BENCHMARK.md).
We published HSTU inference benchmark results on B200 in the [HSTU inference benchmark](./examples/hstu/inference/benchmark/README.md).
We migrated HSTU attention to fbgemm_gpu_hstu, removed the legacy compatibility layer, and improved the training stack (fewer device-to-host syncs in jagged tensor handling, balancer tuning, and debug logging). See [HSTU training setup](./examples/hstu/training/README.md).
[2026/2/13] 🎉v26.01 released!
We optimized HSTU KVCacheManager, moving Python-based KV cache management to optimized C++ implementation with asynchronous onload/offload operation and compression support. Benchmark shows onload and offload latency can be fully hidden under HSTU inference.
We introduced a HSTU training optimization with workload-balanced batch shuffling for data parallel training.
We added caching and prefetching support for EmbeddingBagCollection.
[2026/1/13] 🎉v25.12 released!
Added Triton Inference Server support for HSTU inference. Follow [the HSTU inference Triton example](./examples/hstu/inference/README.md#example-hstu-model-inference-with-triton-inference-server) to try it out.
We introduced our first semantic-id retrieval model example. Follow the semantic‑id retrieval (sid_gr) documentation to run it.

[2025/12/10] 🎉v25.11 released!
DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.

[2025/11/11] 🎉v25.10 released!
HSTU training example supports sequence parallelism.
DynamicEmb supports LRU score checkpointing, gradient clipping.
Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.

[2025/10/20] 🎉v25.09 released!
Integrated prefetching and caching into the HSTU training example.
DynamicEmb now supports distributed embedding dumping and memory scaling.
Added kernel fusion in the HSTU block for inference, including KVCache fixes.
HSTU attention now supports FP8 quantization.

[2025/9/8] 🎉v25.08 released!
Added cache support for DynamicEmb, enabling seamless hot embedding migration between cache and storage.
Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
Enabled evaluation mode support for DynamicEmb.

[2025/8/1] 🎉v25.07 released!
Released HSTU inference benchmark, including a paged KV-cache HSTU kernel, a KV-cache manager based on TensorRT-LLM, CUDA graph, and other optimizations.
Added support for Tensor Parallelism in the HSTU layer.

[2025/7/4] 🎉v25.06 released!
DynamicEmb lookup module performance improvements and LFU eviction support.
Pipeline support for HSTU example, recompute support for HSTU layer, and customized CUDA ops for jagged tensor concat.

[2025/5/29] 🎉v25.05 released!
Enhancements to DynamicEmb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
Fix convergence issues on the Kuairand dataset.

For more detailed release notes, please refer to our [releases][releases].

Get Started

The examples we supported:

[HSTU recommender examples](./examples/hstu/README.md)
[HSTU inference](./examples/hstu/inference/README.md) — KV cache, Triton Inference...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Solid recsys examples repo with moderate stars