Skip to content

Multi-GPU support: redesign monitoring and display pipeline #1296

@shm11C3

Description

@shm11C3

Summary

The current architecture assumes a single GPU throughout the stack. Systems with multiple GPUs (e.g., AMD APU with integrated Radeon Graphics + discrete AMD GPU) suffer from data corruption, missing data, and incorrect display.

Current Problems

Backend

  1. Event payload is single-GPUHardwareMonitorUpdate has flat fields (gpu_usage, gpu_name, gpu_temperature, etc.) for only one GPU
  2. Event emitter discards all but first GPUemit_hardware_update() in system_monitor.rs uses gpu_samples.first(), throwing away data for every other GPU
  3. [Linux] GPU name collisionlspci.rs returns the first VGA device matching a vendor ID. Both AMD iGPU and dGPU have vendor 0x1002, so they get the same name. This causes data overwrite in history HashMap and the archive DB
  4. [Linux] get_gpu_usage() returns only first GPU — early return on first AMD card in the DRM enumeration loop
  5. [Windows] get_amd_gpu_usage() averages all adapters — reports meaningless combined usage when iGPU and dGPU coexist

Frontend

  1. Jotai atoms are single-valuegraphicUsageHistoryAtom, gpuTempAtom, gpuDedicatedMemoryKbAtom store data for one GPU only
  2. Event listener handles single GPUuseHardwareEventListener.ts reads flat GPU fields from the event payload
  3. Dashboard displays one GPUDashboardItems.tsx always uses hardwareInfo.gpus?.[0] for real-time data

What Already Works for Multi-GPU

  • sample_amd_gpu() uses get_amd_gpu_usage_per_adapter() (per-adapter with BDF)
  • update_gpu_histories() uses HashMap<String, VecDeque> keyed by GPU name
  • GPU_DATA_ARCHIVE stores per-GPU records with gpu_name
  • Insights page creates tabs per distinct GPU name from DB
  • get_gpu_info() on all platforms returns Vec<GraphicInfo>

Sub-Issues (in dependency order)

  1. [Linux] Fix GPU name collision for multi-AMD GPU systems #1297 — [Linux] Fix GPU name collision for multi-AMD GPU systems
  2. [Backend] Redesign HardwareMonitorUpdate event payload for multi-GPU #1298 — [Backend] Redesign HardwareMonitorUpdate event payload for multi-GPU
  3. [Frontend] Redesign atoms and event listener for multi-GPU #1299 — [Frontend] Redesign atoms and event listener for multi-GPU
  4. [Frontend] Dashboard multi-GPU display with GPU selector #1300 — [Frontend] Dashboard multi-GPU display with GPU selector

Design Decisions

  • GPU identification: Use PCI BDF-based gpu_id as the internal key, gpu_name as the display label. Prevents name collision and is consistent with Windows ADL path (deduplicates by BDF)
  • Do NOT change GpuPlatform trait: The trait is only used by the on-demand get_gpu_usage command (unused by frontend). The monitoring loop uses sample_gpu() directly
  • Event payload: Replace flat GPU fields with Vec<GpuMonitorData>. Single-GPU systems emit a one-element array — the frontend handles both cases identically
  • Database: Add optional gpu_id column to GPU_DATA_ARCHIVE. Existing data remains valid since it is queried by gpu_name
  • Frontend: Add selectedGpuIdAtom for GPU selection. Derived atoms maintain backward-compatible single-GPU reads for chart components

Scope

  • In scope: Correct per-GPU data collection, storage, and display across all platforms
  • Out of scope: Simultaneous overlay of multiple GPU lines on a single chart (future enhancement)

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions