NVIDIA/dl-lowlat-infer
Cuda
Captured source
source ↗NVIDIA/dl-lowlat-infer
Description: Low Latency inference for sliding window LSTMs
Language: Cuda
License: Apache-2.0
Stars: 14
Forks: 2
Open issues: 0
Created: 2026-02-16T11:11:41Z
Pushed: 2026-04-10T11:33:01Z
Default branch: main
Fork: no
Archived: no
README:
Low Latency inference for sliding window LSTMs
Obtaining the Source Code
Note that this project stores some files in Git Large File System. Make sure to have a Git LFS client installed when cloning this repository. On Ubuntu you can install the client by running
sudo apt-get install git-lfs
Building with CMake
The project requires the following build tools:
- a recent version of
gccsupporting C++20 - CMake >= 3.20
- CUDA Toolkit >= 13.0
Inside your code directory, configure and kick off a build as follows:
cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -B build cmake --build build --parallel
Build Options
Key CMake options:
CMAKE_BUILD_TYPE: Build type (Release or Debug)CPU_ARCH: Target CPU architecture (default:icelake-server, usenativefor current CPU)CMAKE_CUDA_ARCHITECTURES: Target CUDA architecture (e.g.,"90a-real","100a-real","120a-real")CMAKE_INSTALL_PREFIX: Installation directory for thenvLstmInfbinaryTBC_USE_MBARRIER: Use mbarrier primitives for thread block cluster synchronization (default: OFF)
To enable mbarrier optimization, add -DTBC_USE_MBARRIER=ON to the cmake configure command.
Testing with CMake
Tests can be run with ctest by using the generated CTestTestfile.cmake with
cd build/ && ctest
Note that some tests are timing related and may fail when for example built with -DCMAKE_BUILD_TYPE=Debug.
Installation
You may also specify an installation directory to CMake by passing the CMAKE_INSTALL_PREFIX option like so:
cmake -DCMAKE_BUILD_TYPE=Release -DCPU_ARCH=native -DCMAKE_CUDA_ARCHITECTURES="90a-real" -DCMAKE_INSTALL_PREFIX=../../nvLstmInf -B build
Then running cd build && make install -j will install the nvLstmInf executable binary and all relevant configuration files in directory ../../nvLstmInf.
Source Code Structure
The source code is contained in the following directories:
src/cpp/nvLstmInf: CUDA based implementation of LSTM Inferencesrc/python: Python benchmark orchestration scriptsthirdparty: open source libraries used by the benchmark- `libnpy`
test: unit tests and benchmarking toolsdocs: some additional documentation
Structure within the Implementation
The CUDA based implementation is organized as follows:
main.cu: The main entry point which parses command line parameters and runs theBenchmark.Benchmark.cu,Benchmark.cuh: This class executes the benchmark. It instantiates one or severalModelInstances.ModelInstance.cu,ModelInstance.cuh: An instantiation of an independent model instance in theBenchmark. It obtains new input data from aDataIteratorand collectsInferenceResults. Model implementations are expected to comply with theModelInterface.DataIterator.h: Provides access to new input data.InferenceResult.cpp,InferenceResult.h: Collects inference results with timestamps.ModelInterface.h: Defines the interface of virtual methods of a model implementation expected inModelInstance.ModelSingleStepBase.cu,ModelSingleStepBase.cuh: Base class for single-step LSTM implementations.ModelSingleStep.cu,ModelSingleStep.cuh: Default single-step LSTM implementation using cuBLAS for matrix operations.ModelConfig.cuh: Internal representation of model configuration parameters.ModelDef.cpp,ModelDef.h: An internal representation of model weights and the network topology.CudaEnv.cu,CudaEnv.cuh: A class that exposes basic information on the CUDA environment.cuda_helpers.cu,cuda_helpers.cuh: A collection of helpers which wrap low level CUDA into C++17 constructs.cuda_copy.cuh: Helper routines for copying data efficiently between the host and the GPU.util.cpp,util.h: Various helper routines for data loading and saving, implementation of the random sequence generator, etc.ss_gemv_impl\: Persistent timing kernel implementations for LSTM and ping pong modelsModelPersistentBase.cuh: Base class for all persistent timing models.PingPongModelPersistent.cuh: Persistent timing implementation of the ping pong model for baseline measurements.ModelSingleStepGemv.cuh,ModelSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels fromss_gemv_impl\.ModelAsSingleStepGemv.cuh,ModelAsSingleStepGemv.cu: Persistent timing implementation of the LSTM models using kernels fromss_gemv_impl\with mode advanced signaling. The latency critical kernels are launched once in the beginning of the benchmark.
Performance Tuning
Some useful commands:
# This benchmark is sensitive to CPU clock sudo cpupower frequency-set --governor performance # Set max clocks on GH200 sudo nvidia-smi -lgc 1980 # Isolate core IDs through grub # E.g. to isolate 2 cores from the system # Edit /etc/default/grub with GRUB_CMDLINE_LINUX="isolcpus=18,19,54,55" # Where 54-55 are multithreaded cores, if multithreading is enabled sudo update-grub # Then point the benchmark to those cores ids # Run with numactl on multiple socket system numactl --membind=0 -- $NVLSTM_BIN_DIR/nvLstmInf ... # Numa topology: nvidia-smi topo -m
Sample usage
# development in container exports export NVLSTM_DATA_DIR=/app/dl-lowlat-infer/data export NVLSTM_DIR=/code/dl-lowlat-infer export NVLSTM_OUTPUT_DIR=$NVLSTM_DIR/output export NVLSTM_BIN_DIR=$NVLSTM_DIR/build/src/cpp/nvlstm_inf export NVLSTM_PYTHON_DIR=$NVLSTM_DIR/src/python # Generate ONNX models for all three sizes (Optional) python $NVLSTM_PYTHON_DIR/create_onnx_model.py --output-dir=$NVLSTM_DATA_DIR # Convert lstm_s ONNX model to NumPy arrays (Optional). The weights will be dumped into # $NVLSTM_DATA_DIR/models/lstm_s directory python $NVLSTM_PYTHON_DIR/convert_model.py $NVLSTM_DATA_DIR/lstm_s.onnx $NVLSTM_DATA_DIR lstm_s # Generate lstm_s model weights python $NVLSTM_PYTHON_DIR/generate_model.py --num-steps=64…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo with low stars