RepoNVIDIANVIDIApublished Sep 17, 2020seen 5d

NVIDIA/cccl

C++

Open original ↗

Captured source

source ↗
published Sep 17, 2020seen 5dcaptured 10hhttp 200method plain

NVIDIA/cccl

Description: CUDA Core Compute Libraries

Language: C++

License: NOASSERTION

Stars: 2374

Forks: 406

Open issues: 1526

Created: 2020-09-17T18:58:41Z

Pushed: 2026-06-11T03:44:24Z

Default branch: main

Fork: no

Archived: no

README: ![Open in GitHub Codespaces](https://codespaces.new/NVIDIA/cccl?quickstart=1&devcontainer_path=.devcontainer%2Fdevcontainer.json)

|Contributor Guide|Dev Containers|Discord|Godbolt|GitHub Project|Documentation| |-|-|-|-|-|-|

CUDA Core Compute Libraries (CCCL)

Welcome to the CUDA Core Compute Libraries (CCCL) where our mission is to make CUDA more delightful.

This repository unifies three essential CUDA C++ libraries into a single, convenient repository:

The goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code. Bringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++. For more information about the decision to unify these projects, see the announcement here.

Overview

The concept for the CUDA Core Compute Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers. Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.

  • Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).
  • CUB is a lower-level, CUDA-specific library designed for speed-of-light parallel algorithms across all GPU architectures. In addition to device-wide algorithms, it provides *cooperative algorithms* like block-wide reduction and warp-wide scan, providing CUDA kernel developers with building blocks to create speed-of-light, custom kernels.
  • libcudacxx is the CUDA C++ Standard Library. It provides an implementation of the C++ Standard Library that works in both host and device code. Additionally, it provides abstractions for CUDA-specific hardware features like synchronization primitives, cache control, atomics, and more.

The main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter. Unifying these projects is the first step towards realizing that goal.

Example

This is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.

It shows how to use Thrust/CUB/libcudacxx to implement a simple parallel reduction kernel. Each thread block computes the sum of a subset of the array using cub::BlockReduce. The sum of each block is then reduced to a single value using an atomic add via cuda::atomic_ref from libcudacxx.

It then shows how the same reduction can be done using Thrust's reduce algorithm and compares the results.

Try it live on Godbolt!

#include
#include
#include
#include
#include
#include
#include

template
__global__ void reduce(cuda::std::span data, cuda::std::span result) {
using BlockReduce = cub::BlockReduce;
__shared__ typename BlockReduce::TempStorage temp_storage;

int const index = threadIdx.x + blockIdx.x * blockDim.x;
int sum = 0;
if (index atomic_result(result.front());
atomic_result.fetch_add(sum, cuda::memory_order_relaxed);
}
}

int main() {

// Allocate and initialize input data
int const N = 1000;
thrust::device_vector data(N);
thrust::fill(data.begin(), data.end(), 1);

// Allocate output data
thrust::device_vector kernel_result(1);

// Compute the sum reduction of `data` using a custom kernel
constexpr int block_size = 256;
int const num_blocks = cuda::ceil_div(N, block_size);
reduce>>(cuda::std::span(thrust::raw_pointer_cast(data.data()), data.size()),
cuda::std::span(thrust::raw_pointer_cast(kernel_result.data()), 1));

auto const err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
std::cout
#include
#include

GitHub

Users who want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub. Using a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around. For complete information on compatibility between CCCL and the CUDA Toolkit, see [our platform support](#platform-support).

Everything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:

git clone https://github.com/NVIDIA/cccl.git
nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub main.cu -o main

> Note > Use -I and not -isystem to avoid collisions with the CCCL headers implicitly included by nvcc from the CUDA Toolkit. All CCCL headers use #pragma system_header to ensure warnings will still be silenced as if using -isystem, see https://github.com/NVIDIA/cccl/issues/527 for more information.

##### Installation

The default CMake options generate only installation rules, so the familiar cmake . && make install workflow just works:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
cmake . -DCMAKE_INSTALL_PREFIX=/usr/local
make install

A convenience script is also provided:

ci/install_cccl.sh /usr/local

###### Advanced installation using presets

CMake presets are also available with options for including experimental libraries:

cmake --preset install…

Excerpt shown — open the source for the full document.