What does this repo signal mean?

NVIDIA published NVIDIA/dl-lowlat-infer (Cuda). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/dl-lowlat-infer · language Cuda · New repo with low stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure in the data-business radar.

NVIDIA Repo: NVIDIA/dl-lowlat-infer

Captured source

source ↗

GitHub/github.com/NVIDIA/dl-lowlat-infer

NVIDIA/dl-lowlat-infer repository metadata

Source ↗

published Feb 16, 2026seen Jun 5captured Jun 11http 200method plain

NVIDIA/dl-lowlat-infer

Description: Low Latency inference for sliding window LSTMs

Language: Cuda

License: Apache-2.0

Stars: 14

Forks: 2

Open issues: 0

Created: 2026-02-16T11:11:41Z

Pushed: 2026-04-10T11:33:01Z

Default branch: main

Fork: no

Archived: no

README:

Low Latency inference for sliding window LSTMs

Obtaining the Source Code

Note that this project stores some files in Git Large File System. Make sure to have a Git LFS client installed when cloning this repository. On Ubuntu you can install the client by running

sudo apt-get install git-lfs

Building with CMake

The project requires the following build tools:

a recent version of gcc supporting C++20
CMake >= 3.20
CUDA Toolkit >= 13.0

Inside your code directory, configure and kick off a build as follows:

cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -B build
cmake --build build --parallel

Build Options

Key CMake options:

CMAKE_BUILD_TYPE: Build type (Release or Debug)
CPU_ARCH: Target CPU architecture (default: icelake-server, use native for current CPU)
CMAKE_CUDA_ARCHITECTURES: Target CUDA architecture (e.g., "90a-real", "100a-real", "120a-real")
CMAKE_INSTALL_PREFIX: Installation directory for the nvLstmInf binary
TBC_USE_MBARRIER: Use mbarrier primitives for thread block cluster synchronization (default: OFF)

To enable mbarrier optimization, add -DTBC_USE_MBARRIER=ON to the cmake configure command.

Testing with CMake

Tests can be run with ctest by using the generated CTestTestfile.cmake with

cd build/ && ctest

Note that some tests are timing related and may fail when for example built with -DCMAKE_BUILD_TYPE=Debug.

Installation

You may also specify an installation directory to CMake by passing the CMAKE_INSTALL_PREFIX option like so:

cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -DCMAKE_INSTALL_PREFIX=../../nvLstmInf -B build

Then running cd build && make install -j will install the nvLstmInf executable binary and all relevant configuration files in directory ../../nvLstmInf.

Source Code Structure

The source code is contained in the following directories:

src/cpp/nvLstmInf: CUDA based implementation of LSTM Inference
src/python: Python benchmark orchestration scripts
thirdparty: open source libraries used by the benchmark
`libnpy`
test: unit tests and benchmarking tools
docs: some additional documentation

Structure within the Implementation

The CUDA based implementation is organized as follows:

main.cu: The main entry point which parses command line parameters and runs the Benchmark.
Benchmark.cu, Benchmark.cuh: This class executes the benchmark. It instantiates one or several ModelInstances.
ModelInstance.cu, ModelInstance.cuh: An instantiation of an independent model instance in the Benchmark. It obtains new input data from a DataIterator and collects InferenceResults. Model implementations are expected to comply with the ModelInterface.
DataIterator.h: Provides access to new input data.
InferenceResult.cpp, InferenceResult.h: Collects inference results with timestamps.
ModelInterface.h: Defines the interface of virtual methods of a model implementation expected in ModelInstance.
ModelSingleStepBase.cu, ModelSingleStepBase.cuh: Base class for single-step LSTM implementations.
ModelSingleStep.cu, ModelSingleStep.cuh: Default single-step LSTM implementation using cuBLAS for matrix operations.
ModelConfig.cuh: Internal representation of model configuration parameters.
ModelDef.cpp,ModelDef.h: An internal representation of model weights and the network topology.
CudaEnv.cu, CudaEnv.cuh: A class that exposes basic information on the CUDA environment.
cuda_helpers.cu, cuda_helpers.cuh: A collection of helpers which wrap low level CUDA into C++17 constructs.
cuda_copy.cuh: Helper routines for copying data efficiently between the host and the GPU.
util.cpp, util.h: Various helper routines for data loading and saving, implementation of the random sequence generator, etc.
ss_gemv_impl\: Persistent timing kernel implementations for LSTM and ping pong models
ModelPersistentBase.cuh: Base class for all persistent timing models.
PingPongModelPersistent.cuh: Persistent timing implementation of the ping pong model for baseline measurements.
ModelSingleStepGemv.cuh, ModelSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels from ss_gemv_impl\.
ModelAsSingleStepGemv.cuh, ModelAsSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels from ss_gemv_impl\ with mode advanced signaling. The latency critical kernels are launched once in the beginning of the benchmark.

Performance Tuning

Some useful commands:

# This benchmark is sensitive to CPU clock
sudo cpupower frequency-set --governor performance

# Set max clocks on GH200
sudo nvidia-smi -lgc 1980

# Isolate core IDs through grub
# E.g. to isolate 2 cores from the system
# Edit /etc/default/grub with
GRUB_CMDLINE_LINUX="isolcpus=18,19,54,55"
# Where 54-55 are multithreaded cores, if multithreading is enabled
sudo update-grub
# Then point the benchmark to those cores ids

# Run with numactl on multiple socket system
numactl --membind=0 -- $NVLSTM_BIN_DIR/nvLstmInf ...
# Numa topology:
nvidia-smi topo -m

Sample usage

# development in container exports
export NVLSTM_DATA_DIR=/app/dl-lowlat-infer/data
export NVLSTM_DIR=/code/dl-lowlat-infer
export NVLSTM_OUTPUT_DIR=$NVLSTM_DIR/output
export NVLSTM_BIN_DIR=$NVLSTM_DIR/build/src/cpp/nvlstm_inf
export NVLSTM_PYTHON_DIR=$NVLSTM_DIR/src/python

# Generate ONNX models for all three sizes (Optional)
python $NVLSTM_PYTHON_DIR/create_onnx_model.py --output-dir=$NVLSTM_DATA_DIR

# Convert lstm_s ONNX model to NumPy arrays (Optional). The weights will be dumped into
# $NVLSTM_DATA_DIR/models/lstm_s directory
python $NVLSTM_PYTHON_DIR/convert_model.py $NVLSTM_DATA_DIR/lstm_s.onnx $NVLSTM_DATA_DIR lstm_s

# Generate lstm_s model weights
python $NVLSTM_PYTHON_DIR/generate_model.py --num-steps=64...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with low stars