RepoNVIDIANVIDIApublished Feb 16, 2026seen 5d

NVIDIA/dl-lowlat-infer

Cuda

Open original ↗

Captured source

source ↗
published Feb 16, 2026seen 5dcaptured 11hhttp 200method plain

NVIDIA/dl-lowlat-infer

Description: Low Latency inference for sliding window LSTMs

Language: Cuda

License: Apache-2.0

Stars: 14

Forks: 2

Open issues: 0

Created: 2026-02-16T11:11:41Z

Pushed: 2026-04-10T11:33:01Z

Default branch: main

Fork: no

Archived: no

README:

Low Latency inference for sliding window LSTMs

Obtaining the Source Code

Note that this project stores some files in Git Large File System. Make sure to have a Git LFS client installed when cloning this repository. On Ubuntu you can install the client by running

sudo apt-get install git-lfs

Building with CMake

The project requires the following build tools:

Inside your code directory, configure and kick off a build as follows:

cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -B build
cmake --build build --parallel

Build Options

Key CMake options:

  • CMAKE_BUILD_TYPE: Build type (Release or Debug)
  • CPU_ARCH: Target CPU architecture (default: icelake-server, use native for current CPU)
  • CMAKE_CUDA_ARCHITECTURES: Target CUDA architecture (e.g., "90a-real", "100a-real", "120a-real")
  • CMAKE_INSTALL_PREFIX: Installation directory for the nvLstmInf binary
  • TBC_USE_MBARRIER: Use mbarrier primitives for thread block cluster synchronization (default: OFF)

To enable mbarrier optimization, add -DTBC_USE_MBARRIER=ON to the cmake configure command.

Testing with CMake

Tests can be run with ctest by using the generated CTestTestfile.cmake with

cd build/ && ctest

Note that some tests are timing related and may fail when for example built with -DCMAKE_BUILD_TYPE=Debug.

Installation

You may also specify an installation directory to CMake by passing the CMAKE_INSTALL_PREFIX option like so:

cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -DCMAKE_INSTALL_PREFIX=../../nvLstmInf -B build

Then running cd build && make install -j will install the nvLstmInf executable binary and all relevant configuration files in directory ../../nvLstmInf.

Source Code Structure

The source code is contained in the following directories:

  • src/cpp/nvLstmInf: CUDA based implementation of LSTM Inference
  • src/python: Python benchmark orchestration scripts
  • thirdparty: open source libraries used by the benchmark
  • `libnpy`
  • test: unit tests and benchmarking tools
  • docs: some additional documentation

Structure within the Implementation

The CUDA based implementation is organized as follows:

  • main.cu: The main entry point which parses command line parameters and runs the Benchmark.
  • Benchmark.cu, Benchmark.cuh: This class executes the benchmark. It instantiates one or several ModelInstances.
  • ModelInstance.cu, ModelInstance.cuh: An instantiation of an independent model instance in the Benchmark. It obtains new input data from a DataIterator and collects InferenceResults. Model implementations are expected to comply with the ModelInterface.
  • DataIterator.h: Provides access to new input data.
  • InferenceResult.cpp, InferenceResult.h: Collects inference results with timestamps.
  • ModelInterface.h: Defines the interface of virtual methods of a model implementation expected in ModelInstance.
  • ModelSingleStepBase.cu, ModelSingleStepBase.cuh: Base class for single-step LSTM implementations.
  • ModelSingleStep.cu, ModelSingleStep.cuh: Default single-step LSTM implementation using cuBLAS for matrix operations.
  • ModelConfig.cuh: Internal representation of model configuration parameters.
  • ModelDef.cpp,ModelDef.h: An internal representation of model weights and the network topology.
  • CudaEnv.cu, CudaEnv.cuh: A class that exposes basic information on the CUDA environment.
  • cuda_helpers.cu, cuda_helpers.cuh: A collection of helpers which wrap low level CUDA into C++17 constructs.
  • cuda_copy.cuh: Helper routines for copying data efficiently between the host and the GPU.
  • util.cpp, util.h: Various helper routines for data loading and saving, implementation of the random sequence generator, etc.
  • ss_gemv_impl\: Persistent timing kernel implementations for LSTM and ping pong models
  • ModelPersistentBase.cuh: Base class for all persistent timing models.
  • PingPongModelPersistent.cuh: Persistent timing implementation of the ping pong model for baseline measurements.
  • ModelSingleStepGemv.cuh, ModelSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels from ss_gemv_impl\.
  • ModelAsSingleStepGemv.cuh, ModelAsSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels from ss_gemv_impl\ with mode advanced signaling. The latency critical kernels are launched once in the beginning of the benchmark.

Performance Tuning

Some useful commands:

# This benchmark is sensitive to CPU clock
sudo cpupower frequency-set --governor performance

# Set max clocks on GH200
sudo nvidia-smi -lgc 1980

# Isolate core IDs through grub
# E.g. to isolate 2 cores from the system
# Edit /etc/default/grub with
GRUB_CMDLINE_LINUX="isolcpus=18,19,54,55"
# Where 54-55 are multithreaded cores, if multithreading is enabled
sudo update-grub
# Then point the benchmark to those cores ids

# Run with numactl on multiple socket system
numactl --membind=0 -- $NVLSTM_BIN_DIR/nvLstmInf ...
# Numa topology:
nvidia-smi topo -m

Sample usage

# development in container exports
export NVLSTM_DATA_DIR=/app/dl-lowlat-infer/data
export NVLSTM_DIR=/code/dl-lowlat-infer
export NVLSTM_OUTPUT_DIR=$NVLSTM_DIR/output
export NVLSTM_BIN_DIR=$NVLSTM_DIR/build/src/cpp/nvlstm_inf
export NVLSTM_PYTHON_DIR=$NVLSTM_DIR/src/python

# Generate ONNX models for all three sizes (Optional)
python $NVLSTM_PYTHON_DIR/create_onnx_model.py --output-dir=$NVLSTM_DATA_DIR

# Convert lstm_s ONNX model to NumPy arrays (Optional). The weights will be dumped into
# $NVLSTM_DATA_DIR/models/lstm_s directory
python $NVLSTM_PYTHON_DIR/convert_model.py $NVLSTM_DATA_DIR/lstm_s.onnx $NVLSTM_DATA_DIR lstm_s

# Generate lstm_s model weights
python $NVLSTM_PYTHON_DIR/generate_model.py --num-steps=64…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with low stars