Skip to content

Fix race condition between zombie reaper and RPC command execution#1805

Closed
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
mshipkovenski:sidecar-wait-fix
Closed

Fix race condition between zombie reaper and RPC command execution#1805
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
mshipkovenski:sidecar-wait-fix

Conversation

@mshipkovenski
Copy link
Copy Markdown

The sidecar container had a race condition between the reapZombies goroutine and the RPC command handler (CmdExec.Run). Both called Wait4 on the same child processes:

  • reapZombies called Wait4(-1, WNOHANG) to reap any exited child.
  • CmdExec.Run called cmd.Run(), which internally calls cmd.Wait(), which calls waitid(P_PID, ) on the specific child process.

Since the kernel only stores one exit status per process and the first Wait4/waitid call to match a PID consumes it, the reaper could steal the exit status before cmd.Wait() collected it. This caused cmd.Wait() to fail with "waitid: no child processes" (ECHILD), which surfaced in kapp-controller logs as:

Fetching resources: waitid: no child processes

The fix introduces a centralized Reaper that is the sole caller of Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start() instead of cmd.Run() and registers the child PID with the Reaper via a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to tracked PIDs through their channels, and silently discards statuses for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start() and PID registration are atomic with respect to the reaping loop, preventing the Reaper from consuming an exit status before the channel is registered.

Made-with: Cursor

The sidecar container had a race condition between the reapZombies
goroutine and the RPC command handler (CmdExec.Run). Both called
Wait4 on the same child processes:

- reapZombies called Wait4(-1, WNOHANG) to reap any exited child.
- CmdExec.Run called cmd.Run(), which internally calls cmd.Wait(),
  which calls waitid(P_PID, <pid>) on the specific child process.

Since the kernel only stores one exit status per process and the first
Wait4/waitid call to match a PID consumes it, the reaper could steal
the exit status before cmd.Wait() collected it. This caused cmd.Wait()
to fail with "waitid: no child processes" (ECHILD), which surfaced in
kapp-controller logs as:

  Fetching resources: waitid: no child processes

The fix introduces a centralized Reaper that is the sole caller of
Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start()
instead of cmd.Run() and registers the child PID with the Reaper via
a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to
tracked PIDs through their channels, and silently discards statuses
for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start()
and PID registration are atomic with respect to the reaping loop,
preventing the Reaper from consuming an exit status before the
channel is registered.

Made-with: Cursor

ai-assisted=yes

Made-with: Cursor
Signed-off-by: Miroslav Shipkovenski <miroslav.shipkovenski@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants