Problem
libvgpu.so registers signal handlers for SIGUSR1 and SIGUSR2 using signal(), which overwrites any previously installed handlers without saving them. This causes JVM processes to crash with SIGSEGV in Monitor::wait() because the JVM uses SIGUSR1/SIGUSR2 internally for GC safepoints and thread management.
Observed on HAMi volcano-vgpu nodes (hami-core mode) when running PyTorch jobs with a JVM component — the crash occurs at startup before CUDA initializes.
Additional risks from the current implementation:
libvgpu.so intercepts dlsym(), which the JVM also uses for native library loading
- The
ENSURE_RUNNING() spin loop can cause a deadlock if a Java thread holding a JVM monitor gets suspended
Problem
libvgpu.soregisters signal handlers forSIGUSR1andSIGUSR2usingsignal(), which overwrites any previously installed handlers without saving them. This causes JVM processes to crash withSIGSEGVinMonitor::wait()because the JVM usesSIGUSR1/SIGUSR2internally for GC safepoints and thread management.Observed on HAMi volcano-vgpu nodes (hami-core mode) when running PyTorch jobs with a JVM component — the crash occurs at startup before CUDA initializes.
Additional risks from the current implementation:
libvgpu.sointerceptsdlsym(), which the JVM also uses for native library loadingENSURE_RUNNING()spin loop can cause a deadlock if a Java thread holding a JVM monitor gets suspended