coreweave/k8s-device-plugin
forked from NVIDIA/k8s-device-plugin
Captured source
source ↗coreweave/k8s-device-plugin
Description: NVIDIA device plugin for Kubernetes
Language: Go
License: Apache-2.0
Stars: 1
Forks: 2
Open issues: 3
Created: 2023-04-03T17:07:46Z
Pushed: 2026-06-01T21:40:06Z
Default branch: coreweave
Fork: yes
Parent repository: NVIDIA/k8s-device-plugin
Archived: no
README:
NVIDIA device plugin for Kubernetes
Table of Contents
- [About](#about)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
- [Example for debian-based systems with
dockerandcontainerd](#example-for-debian-based-systems-with-docker-and-containerd) - [Install the NVIDIA Container Toolkit](#install-the-nvidia-container-toolkit)
- [Notes on
CRI-Oconfiguration](#notes-on-cri-o-configuration) - [Enabling GPU Support in Kubernetes](#enabling-gpu-support-in-kubernetes)
- [Running GPU Jobs](#running-gpu-jobs)
- [Configuring the NVIDIA device plugin binary](#configuring-the-nvidia-device-plugin-binary)
- [As command line flags or envvars](#as-command-line-flags-or-envvars)
- [As a configuration file](#as-a-configuration-file)
- [Configuration Option Details](#configuration-option-details)
- [Shared Access to GPUs](#shared-access-to-gpus)
- [With CUDA Time-Slicing](#with-cuda-time-slicing)
- [With CUDA MPS](#with-cuda-mps)
- [IMEX Support](#imex-support)
- [Catalog of Labels](#catalog-of-labels)
- [Deployment via
helm](#deployment-via-helm) - [Configuring the device plugin's
helmchart](#configuring-the-device-plugins-helm-chart) - [Passing configuration to the plugin via a
ConfigMap](#passing-configuration-to-the-plugin-via-a-configmap) - [Single Config File Example](#single-config-file-example)
- [Multiple Config File Example](#multiple-config-file-example)
- [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label)
- [Setting other helm chart values](#setting-other-helm-chart-values)
- [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels)
- [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode)
- [Deploying via
helm installwith a direct URL to thehelmpackage](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package) - [Building and Running Locally](#building-and-running-locally)
- [With Docker](#with-docker)
- [Build](#build)
- [Run](#run)
- [Without Docker](#without-docker)
- [Build](#build-1)
- [Run](#run-1)
- [Changelog](#changelog)
- [Issues and Contributing](#issues-and-contributing)
- [Versioning](#versioning)
- [Upgrading Kubernetes with the Device Plugin](#upgrading-kubernetes-with-the-device-plugin)
About
The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:
- Expose the number of GPUs on each nodes of your cluster
- Keep track of the health of your GPUs
- Run GPU enabled containers in your Kubernetes cluster.
This repository contains NVIDIA's official implementation of the Kubernetes device plugin. As of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels, for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md).
Please note that:
- The NVIDIA device plugin API is beta as of Kubernetes v1.10.
- The NVIDIA device plugin is currently lacking
- Comprehensive GPU health checking features
- GPU cleanup features
- Support will only be provided for the official NVIDIA device plugin (and not
for forks or other variants of this plugin).
Prerequisites
The list of prerequisites for running the NVIDIA device plugin is described below:
- NVIDIA drivers ~= 384.81
- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
- nvidia-container-runtime configured as the default low-level runtime
- Kubernetes version >= 1.10
Quick Start
Preparing your GPU Nodes
The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use.
Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Example for debian-based systems with docker and containerd
##### Install the NVIDIA Container Toolkit
For instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the installation guide.
Also note the configuration instructions for:
Remembering to restart each runtime after applying the configuration changes.
If the nvidia runtime should be set as the default runtime (required for docker), the --set-as-default argument must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined.
##### Notes on CRI-O configuration
When running kubernetes with CRI-O, add the config file to set the nvidia-container-runtime as the default low-level OCI runtime under /etc/crio/crio.conf.d/99-nvidia.conf. This will take priority over the default crun config file at /etc/crio/crio.conf.d/10-crun.conf:
[crio] [crio.runtime] default_runtime = "nvidia" [crio.runtime.runtimes] [crio.runtime.runtimes.nvidia] runtime_path = "/usr/bin/nvidia-container-runtime" runtime_type = "oci"
As stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command:
sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf
CRI-O uses crun as default low-level OCI runtime so crun needs to be added to the runtimes of the nvidia-container-runtime in the config file at…
Excerpt shown — open the source for the full document.