Fix: Detect init container failure in EphemeralRunner controller#4457
Fix: Detect init container failure in EphemeralRunner controller#4457Okabe-Junya wants to merge 1 commit intoactions:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the EphemeralRunner reconciler to detect init container failures via pod.Status.InitContainerStatuses so runners don’t get stuck in Pending when the main runner container never starts (and therefore has no ContainerStatus yet).
Changes:
- Add
initContainerFailed()helper to detect init container failures from pod status. - Add a reconcile branch to treat init-container failure as a pod failure and trigger deletion/restart (or EphemeralRunner deletion if a job is assigned).
- Add controller tests covering init container failure with/without a job assigned.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| controllers/actions.github.com/ephemeralrunner_controller.go | Adds init container failure detection and a new reconcile branch to restart/cleanup stuck runners earlier. |
| controllers/actions.github.com/ephemeralrunner_controller_test.go | Adds test cases for init container failure handling paths (recreate pod vs delete EphemeralRunner). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cs := &pod.Status.InitContainerStatuses[i] | ||
| if cs.State.Terminated != nil && cs.State.Terminated.ExitCode != 0 { | ||
| return true | ||
| } |
There was a problem hiding this comment.
initContainerFailed only checks cs.State.Terminated. In common kubelet failure modes (e.g., init container CrashLoopBackOff / restart backoff), the current state is Waiting and the failure is recorded under cs.LastTerminationState.Terminated. In that case this helper will return false and the reconciler can still fall into the cs == nil branch, leaving the EphemeralRunner stuck. Consider also inspecting LastTerminationState.Terminated (ideally gated on cs.State.Waiting != nil) so restarts are triggered for restart/backoff scenarios too.
| } | |
| } | |
| if cs.State.Waiting != nil && | |
| cs.LastTerminationState.Terminated != nil && | |
| cs.LastTerminationState.Terminated.ExitCode != 0 { | |
| return true | |
| } |
| case initContainerFailed(pod): | ||
| log.Info("Pod has a failed init container, deleting pod as failed so it can be restarted", | ||
| "initContainerStatuses", pod.Status.InitContainerStatuses, | ||
| ) | ||
| return ctrl.Result{}, r.deleteEphemeralRunnerOrPod(ctx, ephemeralRunner, pod, log) |
There was a problem hiding this comment.
In the init-container-failure branch, deletePodAsFailed will copy pod.Status.Reason/pod.Status.Message onto the EphemeralRunner status. For Pending pods with init container termination details, those fields are often empty (the detail lives on the init container status), which reduces observability and can overwrite any existing status message with "". Consider deriving a reason/message from the failing init container status (reason/message/exitCode) and surfacing that on the EphemeralRunner before deleting/recreating the pod.
| // Simulate init container failure without PodFailed phase. | ||
| // This can happen when the kubelet has not yet transitioned the pod phase. | ||
| pod.Status.Phase = corev1.PodPending | ||
| pod.Status.InitContainerStatuses = []corev1.ContainerStatus{ | ||
| { | ||
| Name: "setup", | ||
| State: corev1.ContainerState{ | ||
| Terminated: &corev1.ContainerStateTerminated{ | ||
| ExitCode: 1, | ||
| Reason: "StartError", | ||
| Message: "failed to create containerd task: context canceled", | ||
| }, | ||
| }, | ||
| }, | ||
| } |
There was a problem hiding this comment.
The new tests only cover the init container being in State.Terminated with a non-zero exit code. If the real-world stuck scenario manifests as State.Waiting with the failure captured in LastTerminationState.Terminated (restart/backoff), this change won’t be exercised. Consider adding a test case that sets InitContainerStatuses to Waiting + LastTerminationState.Terminated (non-zero) to ensure the reconciler takes the init-failure branch in that mode too.
What
Add explicit
InitContainerStatusesinspection to the EphemeralRunner reconciler so that init container failures are detected even before the pod phase transitions toFailedinitContainerFailed()helper that checkspod.Status.InitContainerStatusesfor any terminated container with a non-zero exit codePodFailedandcs == nilcases) to handle init container failuresWhy
Ref. #4452
When an init container fails (e.g.,
Init:StartErrordue to a containerd runtime error), the runner container never starts, sorunnerContainerStatus()returnsnil. If the pod phase has not yet transitioned toFailed, the controller falls into thecs == nilbranch and logs"Waiting for runner container status to be available"without taking any action. This can cause EphemeralRunners to remain stuck inPendingstatus indefinitely, consuming runner pool capacity