RepoNVIDIANVIDIApublished Mar 3, 2021seen 3d

NVIDIA/nvbench

Cuda

Open original ↗

Captured source

source ↗
published Mar 3, 2021seen 3dcaptured 3dhttp 200method plain

NVIDIA/nvbench

Description: CUDA Kernel Benchmarking Library

Language: Cuda

License: Apache-2.0

Stars: 876

Forks: 109

Open issues: 68

Created: 2021-03-03T19:29:55Z

Pushed: 2026-06-22T17:22:33Z

Default branch: main

Fork: no

Archived: no

README:

Overview

This project is a work-in-progress. Everything is subject to change.

NVBench is a C++17 library designed to simplify CUDA kernel benchmarking. It features:

  • [Parameter sweeps](docs/benchmarks.md#parameter-axes): a powerful and

flexible "axis" system explores a kernel's configuration space. Parameters may be dynamic numbers/strings or [static types](docs/benchmarks.md#type-axes).

  • [Runtime customization](docs/cli_help.md): A rich command-line interface

allows [redefinition of parameter axes](docs/cli_help_axis.md), CUDA device selection, locking GPU clocks (Volta+), changing output formats, and more.

  • [Throughput calculations](docs/benchmarks.md#throughput-measurements): Compute

and report:

  • Item throughput (elements/second)
  • Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
  • Multiple output formats: Currently supports markdown (default) and CSV output.
  • [Manual timer mode](docs/benchmarks.md#explicit-timer-mode-nvbenchexec_tagtimer):

(optional) Explicitly start/stop timing in a benchmark implementation.

  • Multiple measurement types:
  • Cold Measurements:
  • Each sample runs the benchmark once with a clean device L2 cache.
  • GPU and CPU times are reported.
  • Batch Measurements:
  • Executes the benchmark multiple times back-to-back and records total time.
  • Reports the average execution time (total time / number of executions).
  • [CPU-only Measurements](docs/benchmarks.md#cpu-only-benchmarks)
  • Measures the host-side execution time of a non-GPU benchmark.
  • Not suitable for microbenchmarking.

Check out this talk for an overview of the challenges inherent to CUDA kernel benchmarking and how NVBench solves them for you!

Supported Compilers and Tools

  • CMake > 3.30.4
  • CUDA Toolkit + nvcc: 12.0 and above
  • g++: 7 -> 14
  • clang++: 14 -> 19
  • Headers are tested with C++17 -> C++20.

Getting Started

Minimal Benchmark

A basic kernel benchmark can be created with just a few lines of CUDA C++:

void my_benchmark(nvbench::state& state) {
state.exec([](nvbench::launch& launch) {
my_kernel>>();
});
}
NVBENCH_BENCH(my_benchmark);

See [Benchmarks](docs/benchmarks.md) for information on customizing benchmarks and implementing parameter sweeps.

Command Line Interface

Each benchmark executable produced by NVBench provides a rich set of command-line options for configuring benchmark execution at runtime. See the [CLI overview](docs/cli_help.md) and [CLI axis specification](docs/cli_help_axis.md) for more information.

Examples

This repository provides a number of [examples](examples/) that demonstrate various NVBench features and usecases:

  • [Runtime and compile-time parameter sweeps](examples/axes.cu)
  • [CPU-only benchmarking](examples/cpu_only.cu)
  • [Enums and compile-time-constant-integral parameter axes](examples/enums.cu)
  • [Reporting item/sec and byte/sec throughput statistics](examples/throughput.cu)
  • [Skipping benchmark configurations](examples/skip.cu)
  • [Benchmarking on a specific stream](examples/stream.cu)
  • [Adding / hiding columns (summaries) in markdown output](examples/summaries.cu)
  • [Benchmarks that sync CUDA devices: nvbench::exec_tag::sync](examples/exec_tag_sync.cu)
  • [Manual timing: nvbench::exec_tag::timer](examples/exec_tag_timer.cu)

Building Examples

To build the examples:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=70 .. && make

Be sure to set CMAKE_CUDA_ARCHITECTURE based on the GPU you are running on.

Examples are built by default into build/bin and are prefixed with nvbench.example.

Example output from nvbench.example.throughput

# Devices

## [0] `Quadro GV100`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1627 MHz
* Global Memory: 32163 MiB Free / 32508 MiB Total
* Global Memory Bus Peak: 870 GiB/sec (4096-bit DDR @850MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

Run: throughput_bench [Device=0]
Warn: Current measurement timed out (15.00s) while over noise threshold (1.26% > 0.50%)
Pass: Cold: 0.262392ms GPU, 0.267860ms CPU, 7.19s total GPU, 27393x
Pass: Batch: 0.261963ms GPU, 7.18s total GPU, 27394x

# Benchmark Results

## throughput_bench

### [0] Quadro GV100

| NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWPeak | Batch GPU | Batch |
|-------------|------------|---------|------------|-------|------------|-------|---------|---------------|--------|------------|--------|
| 16777216 | 64.000 MiB | 27393x | 267.860 us | 1.25% | 262.392 us | 1.26% | 63.940G | 476.387 GiB/s | 58.77% | 261.963 us | 27394x |

Demo Project

To get started using NVBench with your own kernels, consider trying out the NVBench Demo Project.

nvbench_demo provides a simple CMake project that uses NVBench to build an example benchmark. It's a great way to experiment with the library without a lot of investment.

Contributing

Contributions are welcome!

Tests

To build nvbench tests:

mkdir -p build
cd build
cmake -DNVBench_ENABLE_TESTING=ON .. && make

Tests are built by default into build/bin and prefixed with nvbench.test.

To run all tests:

make test

or

ctest

License

NVBench is released under the Apache 2.0 License with LLVM exceptions. See [LICENSE](./LICENSE).

Scope and Related Projects

NVBench will measure the CPU and CUDA GPU execution time of a *single host-side critical region* per benchmark. It is intended for regression testing and parameter tuning of individual kernels. For in-depth analysis of end-to-end performance of multiple applications, the NVIDIA Nsight tools are more appropriate.

NVBench is focused on evaluating the performance of CUDA kernels. It also provides CPU-only benchmarking facilities intended for non-trivial CPU workloads, but is not optimized for CPU microbenchmarks. This may change in the future, but for now, consider using Google Benchmark for...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New benchmarking tool by NVIDIA with 876 stars, solid interest.