RepoNVIDIANVIDIApublished Nov 22, 2024seen 5d

NVIDIA/nodewright-packages

Python

Open original ↗

Captured source

source ↗
published Nov 22, 2024seen 5dcaptured 13hhttp 200method plain

NVIDIA/nodewright-packages

Description: Packages for the Skyhook Kubernetes Operator.

Language: Python

License: Apache-2.0

Stars: 9

Forks: 1

Open issues: 6

Created: 2024-11-22T17:40:35Z

Pushed: 2026-06-10T17:27:19Z

Default branch: main

Fork: no

Archived: no

README:

NodeWright Packages

*Formerly known as Skyhook Packages*

This repository contains pre-built packages for the NVIDIA NodeWright Operator, a Kubernetes-aware package manager for cluster administrators to safely modify and maintain underlying hosts declaratively at scale.

> Note: NodeWright is being renamed from Skyhook. Code, CRDs, Helm charts, and CLI commands still use skyhook for now. The rename will roll out incrementally to avoid breaking changes.

Overview

NodeWright packages follow a well-defined lifecycle with multiple stages (apply, config, interrupt, post-interrupt, upgrade, uninstall) that ensure proper installation, configuration, and management of node-level changes. Each package in this repository implements these lifecycle stages according to its specific purpose.

For detailed information about package lifecycle stages, see [PACKAGE_LIFECYCLE.md](./PACKAGE_LIFECYCLE.md).

Available Packages

1. Shellscript Package (shellscript/)

A versatile package that allows you to run arbitrary bash scripts defined in your NodeWright Custom Resource configMaps.

Capabilities:

  • Execute custom bash scripts during any lifecycle stage
  • Full lifecycle support (apply, config, post-interrupt, uninstall with checks)
  • Configurable through configMaps
  • Useful for custom automation and system modifications

Example use cases:

  • Custom software installation
  • System configuration changes
  • File management operations
  • Service management tasks

2. Tuning Package (tuning/)

A specialized package for system-level tuning and configuration management.

Capabilities:

  • System service configuration via drop-in files
  • Kernel parameter tuning (sysctl)
  • User limit configuration (ulimits)
  • Container runtime limit configuration
  • GRUB configuration management
  • Support for different interrupt types based on configuration changes

Supported configuration types:

  • grub.conf - GRUB kernel parameters (requires reboot)
  • sysctl.conf - Kernel parameters (requires reboot or service restart)
  • ulimit.conf - User limits (immediate effect + container limits on reboot)
  • service_{name}.conf - Systemd service configurations (requires service restart)

3. Tuned Package (tuned/)

A package for managing the tuned system tuning daemon on Linux systems for automated performance optimization.

Capabilities:

  • Multi-distribution support (Ubuntu/Debian, CentOS/RHEL/Amazon Linux, Fedora)
  • Automated tuned package installation and service management
  • Custom tuned profile deployment via configmaps
  • Built-in profile validation and verification
  • Handles necessary interrupts for tuning parameters that require reboots/restarts
  • Comprehensive installation and configuration validation

Key features:

  • Deploy custom tuned profiles from configmaps
  • Apply system-wide performance tuning profiles
  • Automatic service lifecycle management (install, configure, validate, uninstall)
  • Support for built-in profiles (balanced, powersave, throughput-performance, etc.)
  • Idempotent operations safe for repeated execution

4. Kdump Package (kdump/)

A package for automated installation and configuration of kdump crash dump collection on Linux systems.

Capabilities:

  • Multi-distribution support (Ubuntu/Debian, CentOS/RHEL/Amazon Linux, Fedora)
  • Automated kdump package installation and service management
  • Crashkernel parameter configuration in GRUB
  • Custom kdump configuration deployment via configmaps
  • Comprehensive validation and health checks
  • Safe uninstallation with complete cleanup

Key features:

  • Configure kernel crash dump functionality for debugging system failures
  • Automatic crashkernel memory reservation in GRUB
  • Support for custom kdump.conf configurations
  • Post-interrupt validation of crash kernel functionality
  • Complete lifecycle management (install, configure, validate, uninstall)

5. NVIDIA Setup Package (nvidia-setup/)

A package that applies the same node setup steps as the dgxcloud_aws_eks VMI for selected (service, accelerator) combinations. Runs after the machine is up (NodeWright on a live node).

Capabilities:

  • Opinionated defaults per (service, accelerator) with optional env overrides (EIDOS_KERNEL, EIDOS_EFA, EIDOS_LUSTRE)
  • Apply: upgrade, EFA driver, Lustre client, chrony, setup-local-disks (install and run); reboot may be required after apply
  • Apply-check: validate all steps for the selected combination are complete
  • Supported combinations: eks-h100, eks-gb200 (extensible via case + functions in apply.sh / apply_check.sh)

Key features:

  • ConfigMap: service and accelerator only; versions baked in defaults/*.conf
  • No OFI, hardening, or system-node-settings; see [nvidia-setup README](./nvidia-setup/README.md)

6. NVIDIA Tuning GKE Package (nvidia-tuning-gke/)

Extends the tuning package with baked-in H100 and GB200 configs for GKE Container Optimized OS. You supply only accelerator and intent; the package selects the matching sysctl (and optional containerd drop-in) and runs the base tuning apply. No grub—GKE nodes do not use grub. Note: this is a limited set from nvidia-tuned due to the limitations of the mainly read-only OS. For non COS GKE setups consider updating nvidia-tuned to support gke and use the base profiles.

Capabilities:

  • Sysctl and service drop-ins derived from [nvidia-tuned](./nvidia-tuned/)
  • ConfigMap: accelerator (h100, gb200) and intent (inference, multiNodeTraining)
  • Baked-in profiles under profiles/{accelerator}/{intent}/

Key features:

  • No manual sysctl.conf authoring; profile content is fixed in the image
  • See [nvidia-tuning-gke README](./nvidia-tuning-gke/README.md)

7. Copy-Fail Package (copy-fail/)

A temporary mitigation package for CVE-2026-31431 ("Copy Fail") that disables the vulnerable algif_aead kernel module.

Capabilities:

  • Writes /etc/modprobe.d/disable-algif.conf to prevent algif_aead from loading at boot
  • Best-effort rmmod algif_aead for the running kernel (tolerates the module being in use)
  • Strict config-check that loudly fails when algif_aead is still loaded; opt-out via…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

New repo, low stars.