NVIDIA/nvkind
Go
Captured source
source ↗NVIDIA/nvkind
Language: Go
License: Apache-2.0
Stars: 204
Forks: 28
Open issues: 17
Created: 2024-03-26T20:52:32Z
Pushed: 2026-06-09T18:51:54Z
Default branch: main
Fork: no
Archived: no
README:
Running kind clusters with GPUs using nvkind
This repo provides a tool called nvkind to create and manage kind clusters with access to GPUs.
Unfortunately, running kind with access to GPUs is not very straightforward. There is no standard way to inject GPUs support into a kind worker node, and even with a series of "hacks" to make it possible, some post processing still needs to be performed to ensure that different sets of GPUs can be isolated to different worker nodes.
The nvkind tool encapsulate the set of steps required to do what is described above. It can either be run directly, or you can import pkg/nvkind as a starting point to write your own tool.
Prerequisites
The following prerequisites are required to build and run nvkind as well as follow all of the examples provided in this README:
Prerequisite | Link ------------ | ------------------------------------- go | https://go.dev/doc/install make | https://www.gnu.org/software/make/#download docker | https://docs.docker.com/get-docker/ kind | https://kind.sigs.k8s.io/docs/user/quick-start/#installation kubectl | https://kubernetes.io/docs/tasks/tools/ helm | https://helm.sh/docs/intro/install/
You must also ensure that you are running on a host with a working NVIDIA driver and an nvidia-container-toolkit configured for use with docker.
Prerequisite | Link ------------------------ | ------------------------------------- nvidia-driver | https://www.nvidia.com/download/index.aspx nvidia-container-toolkit | https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Running nvidia-smi -L on a host with a functioning driver should produce output such as the following:
$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2) GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-ec9d53cc-125d-d4a3-9687-304df8eb4749) GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2) GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-8216274a-c05d-def0-af18-c74647300267) GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763)
Likewise, running the following on a host with a functioning nvidia-container-toolkit that has been configured for docker should produce the same output as above:
$ docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu:20.04 nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2) GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-ec9d53cc-125d-d4a3-9687-304df8eb4749) GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2) GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-8216274a-c05d-def0-af18-c74647300267) GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763)
If you have the nvidia-container-toolkit installed, but you have an error when trying to run the docker command above, skip to the [Setup](#setup) section below to see if some of the configuration steps there resolve the issue.
Setup
With all of the prerequisites installed, run the following commands to configure the nvidia-container-toolkit for use with kind.
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place sudo systemctl restart docker
The first command ensures that docker is configured for use with the toolkit and that the nvidia runtime is set as its default. The second command enables a feature flag of the toolkit as described in this document). This feature is leveraged to allow us to inject GPU support into each kind worker node.
To ensure that this feature has been enabled correctly, run the following and verify you get output similar to the following:
$ docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:20.04 nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2) GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-ec9d53cc-125d-d4a3-9687-304df8eb4749) GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2) GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-8216274a-c05d-def0-af18-c74647300267) GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-b1028956-cfa2-0990-bf4a-5da9abb51763)
Install nvkind
To install nvkind using go install, run the following command:
go install github.com/NVIDIA/nvkind/cmd/nvkind@latest
You can also build it in the go container if you don't have go set up on your system:
docker run --rm -v $PWD/bin/:/go/bin/ golang:1.23 go install github.com/NVIDIA/nvkind/cmd/nvkind@latest
Quickstart
Assuming all of the [prerequisites](#prerequisites) have been meet and [setup steps](#setup) have been followed, the following set of commands can be used to build nvkind, create a set of GPU enabled clusters with it, and then print the set of GPUs available on all nodes of a given cluster.
Build nvkind:
make
Default cluster: One node with all GPUs
Create a default cluster with 1 worker node with access to all GPUs on the machine:
./nvkind cluster create
Example: One node per GPU
Create a cluster with 1 worker node per GPU on the machine:
./nvkind cluster create \ --config-template=examples/one-worker-per-gpu.yaml
Example: Four nodes with two GPUs
Assuming a machine with 8 GPUs, create a cluster with 4 worker nodes and 2 GPUs evenly…
Excerpt shown — open the source for the full document.