NVIDIA/k8s-device-plugin
Go
Captured source
source ↗NVIDIA/k8s-device-plugin
Description: NVIDIA device plugin for Kubernetes
Language: Go
License: Apache-2.0
Stars: 3785
Forks: 830
Open issues: 73
Created: 2017-10-10T21:31:02Z
Pushed: 2026-06-10T18:17:49Z
Default branch: main
Fork: no
Archived: no
README:
NVIDIA device plugin for Kubernetes
Table of Contents
- [About](#about)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
- [Example for debian-based systems with
dockerandcontainerd](#example-for-debian-based-systems-with-docker-and-containerd) - [Install the NVIDIA Container Toolkit](#install-the-nvidia-container-toolkit)
- [Notes on
CRI-Oconfiguration](#notes-on-cri-o-configuration) - [Enabling GPU Support in Kubernetes](#enabling-gpu-support-in-kubernetes)
- [Running GPU Jobs](#running-gpu-jobs)
- [Configuring the NVIDIA device plugin binary](#configuring-the-nvidia-device-plugin-binary)
- [As command line flags or envvars](#as-command-line-flags-or-envvars)
- [As a configuration file](#as-a-configuration-file)
- [Configuration Option Details](#configuration-option-details)
- [Shared Access to GPUs](#shared-access-to-gpus)
- [With CUDA Time-Slicing](#with-cuda-time-slicing)
- [With CUDA MPS](#with-cuda-mps)
- [IMEX Support](#imex-support)
- [Catalog of Labels](#catalog-of-labels)
- [Deployment via
helm](#deployment-via-helm) - [Configuring the device plugin's
helmchart](#configuring-the-device-plugins-helm-chart) - [Passing configuration to the plugin via a
ConfigMap](#passing-configuration-to-the-plugin-via-a-configmap) - [Single Config File Example](#single-config-file-example)
- [Multiple Config File Example](#multiple-config-file-example)
- [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label)
- [Setting other helm chart values](#setting-other-helm-chart-values)
- [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels)
- [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode)
- [Deploying via
helm installwith a direct URL to thehelmpackage](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package) - [Building and Running Locally](#building-and-running-locally)
- Advanced Topics
- [Using CDI](#docs/cdi/md)
- [With Docker](#with-docker)
- [Build](#build)
- [Run](#run)
- [Without Docker](#without-docker)
- [Build](#build-1)
- [Run](#run-1)
- [Changelog](#changelog)
- [Issues and Contributing](#issues-and-contributing)
- [Versioning](#versioning)
- [Upgrading Kubernetes with the Device Plugin](#upgrading-kubernetes-with-the-device-plugin)
About
The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:
- Expose the number of GPUs on each nodes of your cluster
- Keep track of the health of your GPUs
- Run GPU enabled containers in your Kubernetes cluster.
This repository contains NVIDIA's official implementation of the Kubernetes device plugin. As of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels, for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md).
Please note that:
- The NVIDIA device plugin API is beta as of Kubernetes v1.10.
- The NVIDIA device plugin is currently lacking
- Comprehensive GPU health checking features
- GPU cleanup features
- Support will only be provided for the official NVIDIA device plugin (and not
for forks or other variants of this plugin).
Prerequisites
The list of prerequisites for running the NVIDIA device plugin is described below:
- NVIDIA drivers ~= 384.81
- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
- nvidia-container-runtime configured as the default low-level runtime
- Kubernetes version >= 1.10
Quick Start
Preparing your GPU Nodes
The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use.
Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Example for debian-based systems with docker and containerd
##### Install the NVIDIA Container Toolkit
For instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the installation guide.
Also note the configuration instructions for:
Remembering to restart each runtime after applying the configuration changes.
If the nvidia runtime should be set as the default runtime (with non-cri docker versions, for example), the --set-as-default argument must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia
##### Notes on CRI-O configuration
When running kubernetes with CRI-O, add the config file to set the nvidia-container-runtime as the default low-level OCI runtime under /etc/crio/crio.conf.d/99-nvidia.conf. This will take priority over the default crun config file at /etc/crio/crio.conf.d/10-crun.conf:
[crio] [crio.runtime] default_runtime = "nvidia" [crio.runtime.runtimes] [crio.runtime.runtimes.nvidia] runtime_path = "/usr/bin/nvidia-container-runtime" runtime_type = "oci"
As stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command:
sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf
CRI-O uses crun as default low-level OCI runtime so crun…
Excerpt shown — open the source for the full document.