ForkCoreWeaveCoreWeavepublished Apr 3, 2023seen 6d

coreweave/k8s-device-plugin

forked from NVIDIA/k8s-device-plugin

Open original ↗

Captured source

source ↗
published Apr 3, 2023seen 6dcaptured 8hhttp 200method plain

coreweave/k8s-device-plugin

Description: NVIDIA device plugin for Kubernetes

Language: Go

License: Apache-2.0

Stars: 1

Forks: 2

Open issues: 3

Created: 2023-04-03T17:07:46Z

Pushed: 2026-06-01T21:40:06Z

Default branch: coreweave

Fork: yes

Parent repository: NVIDIA/k8s-device-plugin

Archived: no

README:

NVIDIA device plugin for Kubernetes

Table of Contents

  • [About](#about)
  • [Prerequisites](#prerequisites)
  • [Quick Start](#quick-start)
  • [Preparing your GPU Nodes](#preparing-your-gpu-nodes)
  • [Example for debian-based systems with docker and containerd](#example-for-debian-based-systems-with-docker-and-containerd)
  • [Install the NVIDIA Container Toolkit](#install-the-nvidia-container-toolkit)
  • [Notes on CRI-O configuration](#notes-on-cri-o-configuration)
  • [Enabling GPU Support in Kubernetes](#enabling-gpu-support-in-kubernetes)
  • [Running GPU Jobs](#running-gpu-jobs)
  • [Configuring the NVIDIA device plugin binary](#configuring-the-nvidia-device-plugin-binary)
  • [As command line flags or envvars](#as-command-line-flags-or-envvars)
  • [As a configuration file](#as-a-configuration-file)
  • [Configuration Option Details](#configuration-option-details)
  • [Shared Access to GPUs](#shared-access-to-gpus)
  • [With CUDA Time-Slicing](#with-cuda-time-slicing)
  • [With CUDA MPS](#with-cuda-mps)
  • [IMEX Support](#imex-support)
  • [Catalog of Labels](#catalog-of-labels)
  • [Deployment via helm](#deployment-via-helm)
  • [Configuring the device plugin's helm chart](#configuring-the-device-plugins-helm-chart)
  • [Passing configuration to the plugin via a ConfigMap](#passing-configuration-to-the-plugin-via-a-configmap)
  • [Single Config File Example](#single-config-file-example)
  • [Multiple Config File Example](#multiple-config-file-example)
  • [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label)
  • [Setting other helm chart values](#setting-other-helm-chart-values)
  • [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels)
  • [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode)
  • [Deploying via helm install with a direct URL to the helm package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package)
  • [Building and Running Locally](#building-and-running-locally)
  • [With Docker](#with-docker)
  • [Build](#build)
  • [Run](#run)
  • [Without Docker](#without-docker)
  • [Build](#build-1)
  • [Run](#run-1)
  • [Changelog](#changelog)
  • [Issues and Contributing](#issues-and-contributing)
  • [Versioning](#versioning)
  • [Upgrading Kubernetes with the Device Plugin](#upgrading-kubernetes-with-the-device-plugin)

About

The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

  • Expose the number of GPUs on each nodes of your cluster
  • Keep track of the health of your GPUs
  • Run GPU enabled containers in your Kubernetes cluster.

This repository contains NVIDIA's official implementation of the Kubernetes device plugin. As of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels, for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md).

Please note that:

  • The NVIDIA device plugin API is beta as of Kubernetes v1.10.
  • The NVIDIA device plugin is currently lacking
  • Comprehensive GPU health checking features
  • GPU cleanup features
  • Support will only be provided for the official NVIDIA device plugin (and not

for forks or other variants of this plugin).

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and the nvidia-container-toolkit have been pre-installed. It also assumes that you have configured the nvidia-container-runtime as the default low-level runtime to use.

Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Example for debian-based systems with docker and containerd

##### Install the NVIDIA Container Toolkit

For instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the installation guide.

Also note the configuration instructions for:

Remembering to restart each runtime after applying the configuration changes.

If the nvidia runtime should be set as the default runtime (required for docker), the --set-as-default argument must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined.

##### Notes on CRI-O configuration

When running kubernetes with CRI-O, add the config file to set the nvidia-container-runtime as the default low-level OCI runtime under /etc/crio/crio.conf.d/99-nvidia.conf. This will take priority over the default crun config file at /etc/crio/crio.conf.d/10-crun.conf:

[crio]

[crio.runtime]
default_runtime = "nvidia"

[crio.runtime.runtimes]

[crio.runtime.runtimes.nvidia]
runtime_path = "/usr/bin/nvidia-container-runtime"
runtime_type = "oci"

As stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command:

sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf

CRI-O uses crun as default low-level OCI runtime so crun needs to be added to the runtimes of the nvidia-container-runtime in the config file at…

Excerpt shown — open the source for the full document.