Skip to content

Fix: Detect init container failure in EphemeralRunner controller#4457

Open
Okabe-Junya wants to merge 1 commit intoactions:masterfrom
mercari:Okabe-Junya/fix/init-container-failure-detection
Open

Fix: Detect init container failure in EphemeralRunner controller#4457
Okabe-Junya wants to merge 1 commit intoactions:masterfrom
mercari:Okabe-Junya/fix/init-container-failure-detection

Conversation

@Okabe-Junya
Copy link
Copy Markdown

@Okabe-Junya Okabe-Junya commented Apr 17, 2026

What

Add explicit InitContainerStatuses inspection to the EphemeralRunner reconciler so that init container failures are detected even before the pod phase transitions to Failed

  • Add initContainerFailed() helper that checks pod.Status.InitContainerStatuses for any terminated container with a non-zero exit code
  • Add a new branch in the reconcile switch statement (between the PodFailed and cs == nil cases) to handle init container failures
  • Add two test cases: init container failure without job assigned (pod re-creation) and with job assigned (EphemeralRunner deletion)

Why

Ref. #4452

When an init container fails (e.g., Init:StartError due to a containerd runtime error), the runner container never starts, so runnerContainerStatus() returns nil. If the pod phase has not yet transitioned to Failed, the controller falls into the cs == nil branch and logs "Waiting for runner container status to be available" without taking any action. This can cause EphemeralRunners to remain stuck in Pending status indefinitely, consuming runner pool capacity

@Okabe-Junya Okabe-Junya requested a review from mumoshu as a code owner April 17, 2026 07:55
Copilot AI review requested due to automatic review settings April 17, 2026 07:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the EphemeralRunner reconciler to detect init container failures via pod.Status.InitContainerStatuses so runners don’t get stuck in Pending when the main runner container never starts (and therefore has no ContainerStatus yet).

Changes:

  • Add initContainerFailed() helper to detect init container failures from pod status.
  • Add a reconcile branch to treat init-container failure as a pod failure and trigger deletion/restart (or EphemeralRunner deletion if a job is assigned).
  • Add controller tests covering init container failure with/without a job assigned.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
controllers/actions.github.com/ephemeralrunner_controller.go Adds init container failure detection and a new reconcile branch to restart/cleanup stuck runners earlier.
controllers/actions.github.com/ephemeralrunner_controller_test.go Adds test cases for init container failure handling paths (recreate pod vs delete EphemeralRunner).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cs := &pod.Status.InitContainerStatuses[i]
if cs.State.Terminated != nil && cs.State.Terminated.ExitCode != 0 {
return true
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initContainerFailed only checks cs.State.Terminated. In common kubelet failure modes (e.g., init container CrashLoopBackOff / restart backoff), the current state is Waiting and the failure is recorded under cs.LastTerminationState.Terminated. In that case this helper will return false and the reconciler can still fall into the cs == nil branch, leaving the EphemeralRunner stuck. Consider also inspecting LastTerminationState.Terminated (ideally gated on cs.State.Waiting != nil) so restarts are triggered for restart/backoff scenarios too.

Suggested change
}
}
if cs.State.Waiting != nil &&
cs.LastTerminationState.Terminated != nil &&
cs.LastTerminationState.Terminated.ExitCode != 0 {
return true
}

Copilot uses AI. Check for mistakes.
Comment on lines +354 to +358
case initContainerFailed(pod):
log.Info("Pod has a failed init container, deleting pod as failed so it can be restarted",
"initContainerStatuses", pod.Status.InitContainerStatuses,
)
return ctrl.Result{}, r.deleteEphemeralRunnerOrPod(ctx, ephemeralRunner, pod, log)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the init-container-failure branch, deletePodAsFailed will copy pod.Status.Reason/pod.Status.Message onto the EphemeralRunner status. For Pending pods with init container termination details, those fields are often empty (the detail lives on the init container status), which reduces observability and can overwrite any existing status message with "". Consider deriving a reason/message from the failing init container status (reason/message/exitCode) and surfacing that on the EphemeralRunner before deleting/recreating the pod.

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +383
// Simulate init container failure without PodFailed phase.
// This can happen when the kubelet has not yet transitioned the pod phase.
pod.Status.Phase = corev1.PodPending
pod.Status.InitContainerStatuses = []corev1.ContainerStatus{
{
Name: "setup",
State: corev1.ContainerState{
Terminated: &corev1.ContainerStateTerminated{
ExitCode: 1,
Reason: "StartError",
Message: "failed to create containerd task: context canceled",
},
},
},
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests only cover the init container being in State.Terminated with a non-zero exit code. If the real-world stuck scenario manifests as State.Waiting with the failure captured in LastTerminationState.Terminated (restart/backoff), this change won’t be exercised. Consider adding a test case that sets InitContainerStatuses to Waiting + LastTerminationState.Terminated (non-zero) to ensure the reconciler takes the init-failure branch in that mode too.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants