RepoNVIDIANVIDIApublished Oct 2, 2025seen 15h

NVIDIA/open-nvdebug

Python

Open original ↗

Captured source

source ↗
published Oct 2, 2025seen 15hcaptured 15hhttp 200method plain

NVIDIA/open-nvdebug

Description: Tool to collect debug logs from NVIDIA server components, in band and out-of-band.

Language: Python

License: Apache-2.0

Stars: 6

Forks: 0

Open issues: 0

Created: 2025-10-02T15:44:51Z

Pushed: 2026-06-11T05:57:30Z

Default branch: main

Fork: no

Archived: no

README:

OPEN-NVDEBUG

> SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. > > SPDX-License-Identifier: Apache-2.0

Description

open-nvdebug is NVIDIA's comprehensive diagnostic collection tool that gathers system information from NVIDIA server platforms to troubleshoot issues effectively. It collects data through multiple methods including Out-of-Band (OOB) access via BMC and In-Band (IB) access via host systems using Redfish, SSH, and IPMI protocols.

Features

  • Comprehensive Data Collection: Gathers logs from multiple sources in a single command
  • Out-of-Band (OOB): Remote collection via BMC using Redfish and IPMI
  • In-Band (IB): Direct collection from host operating system via SSH
  • Combined Mode: Simultaneous OOB and IB collection for complete diagnostics
  • Multi-Protocol Support:
  • Redfish API: BMC log collection via Redfish interface
  • SSH: Direct SSH access to BMC and host systems
  • IPMI: IPMI-over-LAN for BMC communication
  • Broad Platform Support: Supports NVIDIA HGX™, MGX™, GB series, GH series, and Workstation platforms
  • Automated Platform Detection: Automatically detects baseboard type and platform architecture
  • Remote & Local Operation: Works from remote machines or directly on the target system
  • Standardized Output: Generates structured logs with HTML reports for easy analysis
  • Parallel Collection: Optimized multi-threaded collection for faster performance
  • Configurable Collectors: Spreadsheet-driven collector definitions for easy customization

Prerequisites

Before you begin, ensure you have met the following requirements:

Client Host Requirements

  • Operating System: Linux-based OS (Ubuntu 24.04 recommended, Ubuntu 20.04+ supported)
  • Kernel: Linux Kernel 4.4 or later (4.15+ recommended)
  • Python: Python 3.12 (required)
  • Required Packages:
sudo apt-get install ipmitool sshpass
  • Hardware: Minimum 4GB RAM, 2GB free disk space
  • Network: Access to target systems via BMC (Redfish/IPMI) and SSH

Server/BMC Requirements

For full functionality, target systems should have:

  • BMC accessible via Redfish, SSH, and IPMI-over-LAN
  • For host collection: SSH access to host OS with sudo privileges
  • For advanced collectors: Additional tools installed (nvme-cli, pciutils, dmidecode, lshw, nvidia-fabricmanager, mft-tools, NVIDIA Graphics Driver, doca-sosreport v4.8.0+, etc.)

Quick Start

Get started with nvdebug in 5 minutes:

Step 1: Verify Installation

python -m src.tool.main --version

Step 2: Run Your First Collection

Out-of-Band Collection (OOB)

Collect logs remotely via BMC without host OS access:

python -m src.tool.main collect -i -u -p

In-Band Collection (IB)

Collect logs directly from the host OS:

python -m src.tool.main collect -I -U -H

Combined OOB + IB Collection

Collect both BMC and host logs:

python -m src.tool.main collect -i -u -p \
-I -U -H

Step 3: Specify Baseboard (Optional)

nvdebug automatically detects your baseboard, but you can specify it manually:

# List available baseboards
python -m src.tool.main list-baseboards

# Collect with specific baseboard
python -m src.tool.main collect -i -u -p -b ""

Example Collection

# ARM64 system () with auto-detection
python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123

# With verbose output for detailed progress
python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123 -v

# With custom output directory
python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123 -o /tmp/my_logs

# Combined OOB and IB collection for
python -m src.tool.main collect -i 192.168.1.100 -u bmc_user -p bmc_pass \
-I 192.168.1.101 -U host_user -H host_pass \
-b "" -o /tmp/nvdebug_output

Advanced Usage

Local Mode

Run nvdebug directly on the target system:

# With BMC access
python -m src.tool.main collect -i -u -p --local

# Without BMC access (host-only collection)
python -m src.tool.main collect --local

Preflight Checks

Run preflight checks to verify system readiness before collection:

python -m src.tool.main preflight -i -u -p

List Available Resources

# List all supported baseboards
python -m src.tool.main list-baseboards

# List all available collectors
python -m src.tool.main list-collectors

# List collectors for specific baseboard
python -m src.tool.main list-collectors -b ""

Configuration File Usage

Create a DUT configuration file (dut_config.yaml) for repeated collections:

duts:
- name: -node-01
bmc_ip: 192.168.1.100
bmc_user: admin
bmc_pass: password123
host_ip: 192.168.1.101
host_user: host_user
host_pass: host_password
baseboard: ""

Run collection using configuration file:

python -m src.tool.main collect --dut-config dut_config.yaml

Collection Options

# Verbose output for detailed progress
python -m src.tool.main collect -i -u -p -v

# Very verbose output for debugging
python -m src.tool.main collect -i -u -p -vv

# Specify custom output directory
python -m src.tool.main collect -i -u -p -o /custom/path

# Specify baseboard manually (skip auto-detection)
python -m src.tool.main collect -i -u -p -b ""

Understanding Output

After running nvdebug, you'll find a timestamped directory containing all collected data:

nvdebug_logs__/
├── .log_signature.txt # Log integrity verification
├── .nvdebug_stdout.log # nvdebug console output
├── reports/ # HTML reports
│ ├── index.html # Main summary report
│ ├── file_map.html # File organization map
│ ├── status_complete.html # Successfully collected data
│ ├── status_error.html # Failed collectors
│ ├── status_partial.html # Partially collected data
│ └── status_skipped.html # Skipped collectors
└── / # Per-device collection
├── config.json # Tool configuration used
├── dut_config.json # Device configuration
├── Execution_Summary_Report.txt # Collection status summary
├── nvdebug_runtime_output.txt # Detailed runtime logs
├──...

Excerpt shown — open the source for the full document.