NVIDIA/open-nvdebug
Python
Captured source
source ↗NVIDIA/open-nvdebug
Description: Tool to collect debug logs from NVIDIA server components, in band and out-of-band.
Language: Python
License: Apache-2.0
Stars: 6
Forks: 0
Open issues: 0
Created: 2025-10-02T15:44:51Z
Pushed: 2026-06-11T05:57:30Z
Default branch: main
Fork: no
Archived: no
README:
OPEN-NVDEBUG
> SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. > > SPDX-License-Identifier: Apache-2.0
Description
open-nvdebug is NVIDIA's comprehensive diagnostic collection tool that gathers system information from NVIDIA server platforms to troubleshoot issues effectively. It collects data through multiple methods including Out-of-Band (OOB) access via BMC and In-Band (IB) access via host systems using Redfish, SSH, and IPMI protocols.
Features
- Comprehensive Data Collection: Gathers logs from multiple sources in a single command
- Out-of-Band (OOB): Remote collection via BMC using Redfish and IPMI
- In-Band (IB): Direct collection from host operating system via SSH
- Combined Mode: Simultaneous OOB and IB collection for complete diagnostics
- Multi-Protocol Support:
- Redfish API: BMC log collection via Redfish interface
- SSH: Direct SSH access to BMC and host systems
- IPMI: IPMI-over-LAN for BMC communication
- Broad Platform Support: Supports NVIDIA HGX™, MGX™, GB series, GH series, and Workstation platforms
- Automated Platform Detection: Automatically detects baseboard type and platform architecture
- Remote & Local Operation: Works from remote machines or directly on the target system
- Standardized Output: Generates structured logs with HTML reports for easy analysis
- Parallel Collection: Optimized multi-threaded collection for faster performance
- Configurable Collectors: Spreadsheet-driven collector definitions for easy customization
Prerequisites
Before you begin, ensure you have met the following requirements:
Client Host Requirements
- Operating System: Linux-based OS (Ubuntu 24.04 recommended, Ubuntu 20.04+ supported)
- Kernel: Linux Kernel 4.4 or later (4.15+ recommended)
- Python: Python 3.12 (required)
- Required Packages:
sudo apt-get install ipmitool sshpass
- Hardware: Minimum 4GB RAM, 2GB free disk space
- Network: Access to target systems via BMC (Redfish/IPMI) and SSH
Server/BMC Requirements
For full functionality, target systems should have:
- BMC accessible via Redfish, SSH, and IPMI-over-LAN
- For host collection: SSH access to host OS with sudo privileges
- For advanced collectors: Additional tools installed (nvme-cli, pciutils, dmidecode, lshw, nvidia-fabricmanager, mft-tools, NVIDIA Graphics Driver, doca-sosreport v4.8.0+, etc.)
Quick Start
Get started with nvdebug in 5 minutes:
Step 1: Verify Installation
python -m src.tool.main --version
Step 2: Run Your First Collection
Out-of-Band Collection (OOB)
Collect logs remotely via BMC without host OS access:
python -m src.tool.main collect -i -u -p
In-Band Collection (IB)
Collect logs directly from the host OS:
python -m src.tool.main collect -I -U -H
Combined OOB + IB Collection
Collect both BMC and host logs:
python -m src.tool.main collect -i -u -p \ -I -U -H
Step 3: Specify Baseboard (Optional)
nvdebug automatically detects your baseboard, but you can specify it manually:
# List available baseboards python -m src.tool.main list-baseboards # Collect with specific baseboard python -m src.tool.main collect -i -u -p -b ""
Example Collection
# ARM64 system () with auto-detection python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123 # With verbose output for detailed progress python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123 -v # With custom output directory python -m src.tool.main collect -i 192.168.1.100 -u admin -p password123 -o /tmp/my_logs # Combined OOB and IB collection for python -m src.tool.main collect -i 192.168.1.100 -u bmc_user -p bmc_pass \ -I 192.168.1.101 -U host_user -H host_pass \ -b "" -o /tmp/nvdebug_output
Advanced Usage
Local Mode
Run nvdebug directly on the target system:
# With BMC access python -m src.tool.main collect -i -u -p --local # Without BMC access (host-only collection) python -m src.tool.main collect --local
Preflight Checks
Run preflight checks to verify system readiness before collection:
python -m src.tool.main preflight -i -u -p
List Available Resources
# List all supported baseboards python -m src.tool.main list-baseboards # List all available collectors python -m src.tool.main list-collectors # List collectors for specific baseboard python -m src.tool.main list-collectors -b ""
Configuration File Usage
Create a DUT configuration file (dut_config.yaml) for repeated collections:
duts: - name: -node-01 bmc_ip: 192.168.1.100 bmc_user: admin bmc_pass: password123 host_ip: 192.168.1.101 host_user: host_user host_pass: host_password baseboard: ""
Run collection using configuration file:
python -m src.tool.main collect --dut-config dut_config.yaml
Collection Options
# Verbose output for detailed progress python -m src.tool.main collect -i -u -p -v # Very verbose output for debugging python -m src.tool.main collect -i -u -p -vv # Specify custom output directory python -m src.tool.main collect -i -u -p -o /custom/path # Specify baseboard manually (skip auto-detection) python -m src.tool.main collect -i -u -p -b ""
Understanding Output
After running nvdebug, you'll find a timestamped directory containing all collected data:
nvdebug_logs__/ ├── .log_signature.txt # Log integrity verification ├── .nvdebug_stdout.log # nvdebug console output ├── reports/ # HTML reports │ ├── index.html # Main summary report │ ├── file_map.html # File organization map │ ├── status_complete.html # Successfully collected data │ ├── status_error.html # Failed collectors │ ├── status_partial.html # Partially collected data │ └── status_skipped.html # Skipped collectors └── / # Per-device collection ├── config.json # Tool configuration used ├── dut_config.json # Device configuration ├── Execution_Summary_Report.txt # Collection status summary ├── nvdebug_runtime_output.txt # Detailed runtime logs ├──...
Excerpt shown — open the source for the full document.