Fix: Detect init container failure in EphemeralRunner controller by Okabe-Junya · Pull Request #4457 · actions/actions-runner-controller

Okabe-Junya · 2026-04-17T07:55:25Z

What

Add explicit InitContainerStatuses inspection to the EphemeralRunner reconciler so that init container failures are detected even before the pod phase transitions to Failed

Add initContainerFailed() helper that checks pod.Status.InitContainerStatuses for any terminated container with a non-zero exit code
Add a new branch in the reconcile switch statement (between the PodFailed and cs == nil cases) to handle init container failures
Add two test cases: init container failure without job assigned (pod re-creation) and with job assigned (EphemeralRunner deletion)

Why

Ref. #4452

When an init container fails (e.g., Init:StartError due to a containerd runtime error), the runner container never starts, so runnerContainerStatus() returns nil. If the pod phase has not yet transitioned to Failed, the controller falls into the cs == nil branch and logs "Waiting for runner container status to be available" without taking any action. This can cause EphemeralRunners to remain stuck in Pending status indefinitely, consuming runner pool capacity

…onciler

Copilot

Pull request overview

This PR updates the EphemeralRunner reconciler to detect init container failures via pod.Status.InitContainerStatuses so runners don’t get stuck in Pending when the main runner container never starts (and therefore has no ContainerStatus yet).

Changes:

Add initContainerFailed() helper to detect init container failures from pod status.
Add a reconcile branch to treat init-container failure as a pod failure and trigger deletion/restart (or EphemeralRunner deletion if a job is assigned).
Add controller tests covering init container failure with/without a job assigned.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
controllers/actions.github.com/ephemeralrunner_controller.go	Adds init container failure detection and a new reconcile branch to restart/cleanup stuck runners earlier.
controllers/actions.github.com/ephemeralrunner_controller_test.go	Adds test cases for init container failure handling paths (recreate pod vs delete EphemeralRunner).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T07:59:13Z

+		cs := &pod.Status.InitContainerStatuses[i]
+		if cs.State.Terminated != nil && cs.State.Terminated.ExitCode != 0 {
+			return true
+		}


initContainerFailed only checks cs.State.Terminated. In common kubelet failure modes (e.g., init container CrashLoopBackOff / restart backoff), the current state is Waiting and the failure is recorded under cs.LastTerminationState.Terminated. In that case this helper will return false and the reconciler can still fall into the cs == nil branch, leaving the EphemeralRunner stuck. Consider also inspecting LastTerminationState.Terminated (ideally gated on cs.State.Waiting != nil) so restarts are triggered for restart/backoff scenarios too.

Suggested change

}

}

if cs.State.Waiting != nil &&

cs.LastTerminationState.Terminated != nil &&

cs.LastTerminationState.Terminated.ExitCode != 0 {

return true

}

Copilot · 2026-04-17T07:59:13Z

+	case initContainerFailed(pod):
+		log.Info("Pod has a failed init container, deleting pod as failed so it can be restarted",
+			"initContainerStatuses", pod.Status.InitContainerStatuses,
+		)
+		return ctrl.Result{}, r.deleteEphemeralRunnerOrPod(ctx, ephemeralRunner, pod, log)


In the init-container-failure branch, deletePodAsFailed will copy pod.Status.Reason/pod.Status.Message onto the EphemeralRunner status. For Pending pods with init container termination details, those fields are often empty (the detail lives on the init container status), which reduces observability and can overwrite any existing status message with "". Consider deriving a reason/message from the failing init container status (reason/message/exitCode) and surfacing that on the EphemeralRunner before deleting/recreating the pod.

Copilot · 2026-04-17T07:59:13Z

+			// Simulate init container failure without PodFailed phase.
+			// This can happen when the kubelet has not yet transitioned the pod phase.
+			pod.Status.Phase = corev1.PodPending
+			pod.Status.InitContainerStatuses = []corev1.ContainerStatus{
+				{
+					Name: "setup",
+					State: corev1.ContainerState{
+						Terminated: &corev1.ContainerStateTerminated{
+							ExitCode: 1,
+							Reason:   "StartError",
+							Message:  "failed to create containerd task: context canceled",
+						},
+					},
+				},
+			}


The new tests only cover the init container being in State.Terminated with a non-zero exit code. If the real-world stuck scenario manifests as State.Waiting with the failure captured in LastTerminationState.Terminated (restart/backoff), this change won’t be exercised. Consider adding a test case that sets InitContainerStatuses to Waiting + LastTerminationState.Terminated (non-zero) to ensure the reconciler takes the init-failure branch in that mode too.

fix(controller): detect init container failure in EphemeralRunner rec…

e0a6af4

…onciler

Okabe-Junya requested a review from mumoshu as a code owner April 17, 2026 07:55

Copilot AI review requested due to automatic review settings April 17, 2026 07:55

Okabe-Junya requested review from a team, Steve-Glass, nikola-jokic, rentziass and toast-gear as code owners April 17, 2026 07:55

Copilot started reviewing on behalf of Okabe-Junya April 17, 2026 07:55 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Okabe-Junya mentioned this pull request Apr 17, 2026

EphemeralRunnerSet controller does not clean up runners stuck in init container failure #4452

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Detect init container failure in EphemeralRunner controller#4457

Fix: Detect init container failure in EphemeralRunner controller#4457
Okabe-Junya wants to merge 1 commit intoactions:masterfrom
mercari:Okabe-Junya/fix/init-container-failure-detection

Okabe-Junya commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-		}
+		}
+		if cs.State.Waiting != nil &&
+			cs.LastTerminationState.Terminated != nil &&
+			cs.LastTerminationState.Terminated.ExitCode != 0 {
+			return true
+		}

Conversation

Okabe-Junya commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Okabe-Junya commented Apr 17, 2026 •

edited

Loading