Fix Claude analysis gating to handle co-failures correctly (#25491)

iangmaia · claude · web-flow · commit 68ff59f389d5 · 2026-04-13T20:28:53.000Z
* Fix Claude analysis gating to handle co-failures correctly

The previous script exited early when any non-essential step (e.g. Danger)
failed, incorrectly skipping Claude analysis even when essential jobs also
failed in the same build.

Now counts non-essential failures first, then queries the Buildkite API for
the total failed job count. Claude is only skipped when all failures are
accounted for by non-essential steps. Fails safe: if the API call fails,
Claude runs anyway.

Also fixes shebang portability, uses a proper Bash array for step keys,
uses YAML literal scalar to preserve prompt formatting, and includes
timed_out jobs in the failure count.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

* Fix Claude analysis gating: use hard_failed and handle all-passed

Two bugs fixed:
- buildkite-agent step get outcome returns "hard_failed", not "failed",
  so the non-essential check never matched
- When all steps passed, the script fell through to uploading Claude
  analysis instead of exiting early

Also restructures the script to query the API first, which simplifies
the flow and avoids a redundant early-exit branch.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.buildkite/claude-analysis.yml b/.buildkite/claude-analysis.yml
@@ -24,7 +24,7 @@ steps:
           build_log_mode: "failed"
           trigger: "on-failure"
           max_log_lines: 1500
-          custom_prompt: >
+          custom_prompt: |
             Ignore these jobs — they are not real failures:
             (1) The "Claude Build Analysis" job itself (uses `exit 1` to trigger this hook).
             (2) The "Danger - PR Check" job (external linter, not indicative of build health).
diff --git a/.buildkite/commands/upload-claude-analysis.sh b/.buildkite/commands/upload-claude-analysis.sh
@@ -1,21 +1,50 @@
-#!/bin/bash -eu
+#!/usr/bin/env bash
+
+set -eu
 
 # Uploads the Claude analysis pipeline only when real build/test failures exist.
-# Skips when only non-essential jobs failed (e.g. Danger).
+# Skips when only non-essential jobs failed (e.g. Danger) or nothing failed.
 
 source .buildkite/shared-pipeline-vars
 
+# Query the Buildkite API to get the total count of failed jobs in this build.
+build_info=$(curl -sf \
+  -H "Authorization: Bearer ${BUILDKITE_TOKEN_FOR_CLAUDE:-}" \
+  "https://api.buildkite.com/v2/organizations/${BUILDKITE_ORGANIZATION_SLUG}/pipelines/${BUILDKITE_PIPELINE_SLUG}/builds/${BUILDKITE_BUILD_NUMBER}" \
+  2>/dev/null || echo "")
+
+if [ -z "${build_info}" ]; then
+  echo "Could not fetch build info; assuming real failures exist, running Claude analysis"
+  buildkite-agent pipeline upload .buildkite/claude-analysis.yml
+  exit 0
+fi
+
+total_failures=$(echo "${build_info}" | jq '[.jobs[] | select(.state == "failed" or .state == "timed_out")] | length' 2>/dev/null || echo "-1")
+
+# Nothing failed — skip Claude analysis.
+if [ "${total_failures}" -eq 0 ]; then
+  echo "All steps passed, skipping Claude analysis"
+  exit 0
+fi
+
 # Non-essential steps whose failure alone should NOT trigger Claude analysis.
 # Add step keys here as needed.
-NON_ESSENTIAL_STEPS="danger"
+NON_ESSENTIAL_STEPS=("danger")
 
-for step_key in ${NON_ESSENTIAL_STEPS}; do
-  OUTCOME=$(buildkite-agent step get outcome --step "${step_key}" 2>/dev/null || echo "not_run")
-  if [ "${OUTCOME}" = "failed" ]; then
-    echo "Only non-essential job '${step_key}' failed, skipping Claude analysis"
-    exit 0
+# Count how many non-essential steps have a "hard_failed" outcome.
+non_essential_failures=0
+for step_key in "${NON_ESSENTIAL_STEPS[@]}"; do
+  outcome=$(buildkite-agent step get outcome --step "${step_key}" 2>/dev/null || echo "not_run")
+  if [ "${outcome}" = "hard_failed" ]; then
+    echo "Non-essential job '${step_key}' failed"
+    non_essential_failures=$((non_essential_failures + 1))
   fi
 done
 
+if [ "${total_failures}" -eq "${non_essential_failures}" ]; then
+  echo "Only non-essential job(s) failed (${non_essential_failures}/${total_failures}), skipping Claude analysis"
+  exit 0
+fi
+
 echo "Real failures detected, running Claude analysis"
 buildkite-agent pipeline upload .buildkite/claude-analysis.yml