Skip to content

Secure discovery boot data with jwt signed tokens#750

Open
stefanhipfel wants to merge 7 commits intomainfrom
issue_749
Open

Secure discovery boot data with jwt signed tokens#750
stefanhipfel wants to merge 7 commits intomainfrom
issue_749

Conversation

@stefanhipfel
Copy link
Copy Markdown
Contributor

@stefanhipfel stefanhipfel commented Mar 23, 2026

Proposed Changes

This PR implements JWT-signed discovery tokens to secure the discovery boot data flow between metalprobe agents and the registry server.

Key Changes

Security Implementation:

  • JWT-based authentication - Implements RFC 7519 standard JWT tokens with HMAC-SHA256 (HS256) signing
  • Signing secret management - Controller creates and manages a 32-byte shared secret in Kubernetes
  • Token generation - Controller generates signed tokens for each server, embedded in ignition configs
  • Token validation - Registry validates tokens on /register and /bootstate endpoints
  • 1-hour token lifetime - Balances security (limited exposure window) with operational flexibility
  • Automatic expiration - JWT library enforces expiration validation

Code Changes:

  • Add internal/token/ package for JWT token generation and verification
  • Update controller to generate per-server discovery tokens
  • Update registry to validate discovery tokens
  • Update metalprobe to send discovery tokens in API requests
  • Add comprehensive unit tests for token functionality
  • Add integration tests validating end-to-end token flow

Documentation:

  • Add docs/discovery-token-security.md - Complete operational guide covering:
    • Token format and security properties
    • Secret lifecycle management
    • Troubleshooting and monitoring
    • Testing procedures

Dependencies:

  • Add github.com/golang-jwt/jwt/v5 - Industry-standard JWT library

Review Feedback Addressed

All CodeRabbitAI feedback has been addressed in commit 986d7cc:

  • ✅ Fixed probe authentication retry logic bug
  • ✅ Fixed registry secret loading race condition
  • ✅ Fixed controller secret creation race
  • ✅ Fixed test race condition with bcrypt hashes
  • ✅ Improved error logging
  • ✅ Added UUID validation
  • ✅ Added test mode security warning

Testing

  • ✅ Unit tests: 85/85 passing
  • ✅ Token tests: 16/16 passing (100% coverage)
  • ✅ Probe tests: All passing
  • ✅ Registry tests: All passing
  • ✅ Controller tests: All passing
  • ✅ Race detector: Clean

Verification

The implementation provides:

  • Authentication: Only controller can generate valid tokens (requires signing secret)
  • Integrity: Tokens cannot be tampered with (signature verification)
  • Binding: Each token bound to specific systemUUID
  • Freshness: Tokens expire after 1 hour
  • Standard format: Industry-standard JWT (RFC 7519)

Fixes #749

Summary by CodeRabbit

  • New Features

    • Added JWT-based discovery token authentication to secure server registration and validation endpoints, preventing unauthorized server discovery.
  • Documentation

    • Added security documentation covering token lifecycle, Kubernetes secret management, operational best practices, troubleshooting, and monitoring recommendations.

@stefanhipfel stefanhipfel marked this pull request as ready for review March 24, 2026 12:00
@stefanhipfel stefanhipfel requested a review from a team as a code owner March 24, 2026 12:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Controller generates per-server HMAC‑HS256 discovery tokens stored in a Kubernetes Secret, injects them into ignition; metalprobe presents the token to registry endpoints; registry loads the signing secret and validates tokens on /register and /bootstate, rejecting invalid or mismatched tokens.

Changes

Cohort / File(s) Summary
Token package & constants
internal/token/token.go, internal/token/constants.go
New token package: 32‑byte signing secret generation, HS256 JWT generation/verification, and exported constants for secret name/key.
Token tests
internal/token/token_test.go
Add unit tests for secret generation, token creation/verification, expiry, tampering, algorithm checks, and error cases.
Registry API payloads
internal/api/registry/registry.go
Add DiscoveryToken field to RegistrationPayload and BootstatePayload.
Registry server & tests
internal/registry/server.go, internal/registry/registry_suite_test.go
Server stores/caches signing secret, new helpers to load/validate secret and tokens; NewServer accepts signing-secret name+namespace; /register and /bootstate enforce token validation and UUID binding (return 401 on auth failures).
Controller: ignition & secret management
internal/controller/server_controller.go, internal/controller/server_controller_test.go, internal/controller/suite_test.go
Controller adds getOrCreateSigningSecret, generates signed discovery tokens per server, injects --discovery-token=<token> into probe args; tests updated to generate/extract tokens and pass signing-secret name/namespace.
Probe agent & tests
internal/probe/probe.go, internal/probe/probe_suite_test.go
Agent adds DiscoveryToken field and NewAgent parameter; registration payload includes token; registerServer treats 401/403 as authentication failures (stop retries).
CLI / entrypoints
cmd/main.go, cmd/metalprobe/main.go
Registry constructor calls extended to pass signing-secret name and namespace; metalprobe adds --discovery-token flag and fails startup if missing.
Docs
docs/discovery-token-security.md
New documentation describing token format, lifecycle, secret management, validation semantics, ops guidance, troubleshooting, and testing.
Test/wiring updates
internal/.../*_suite_test.go, internal/controller/*_test.go, internal/probe/*_test.go
Multiple test call-sites updated to pass additional parameters or test tokens and to assert token-related behavior.
Dependencies
go.mod
Add github.com/golang-jwt/jwt/v5 dependency.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes
    participant Ctrl as Controller
    participant Probe as Metalprobe
    participant Registry as Registry Server

    Ctrl->>K8s: getOrCreateSigningSecret()
    K8s-->>Ctrl: signing-key (32 bytes)
    Ctrl->>Ctrl: GenerateSignedDiscoveryToken(signing-key, systemUUID)
    Ctrl-->>K8s: Apply ignition with --discovery-token=token
    K8s-->>Probe: Boot with token flag

    Probe->>Registry: POST /register (DiscoveryToken)
    Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
    alt Token Valid
        Registry-->>Probe: 200 OK
    else Token Invalid/Missing
        Registry-->>Probe: 401 Unauthorized
    end

    Probe->>Registry: POST /bootstate (DiscoveryToken)
    Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
    alt Token Valid
        Registry-->>Probe: 200 OK
    else Token Invalid/Missing
        Registry-->>Probe: 401 Unauthorized
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement, api-change, highlight

Suggested reviewers

  • afritzler
  • xkonni
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the primary change: implementing JWT-signed tokens to secure discovery boot data, which is the main focus of this PR.
Description check ✅ Passed The description is comprehensive and well-structured, covering key changes, testing, verification, and addressing feedback. All critical details are present and properly organized.
Linked Issues check ✅ Passed The PR fully implements all acceptance criteria from issue #749: per-server token generation via controller [token.go], secret storage in Kubernetes [server_controller.go], ignition delivery [server_controller.go, cmd/main.go], registry validation [server.go], and rejection of invalid tokens [server.go].
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #749 objectives. The token package, controller modifications, registry updates, documentation, and tests all serve the goal of securing discovery boot data with per-server tokens.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue_749

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
cmd/main.go (1)

627-633: Consider extracting the hardcoded secret name to a shared constant.

The secret name "discovery-token-signing-secret" is duplicated here and in server_controller.go (line 723). Extracting it to a shared constant would prevent accidental drift between the controller and registry server configurations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 627 - 633, The secret name
"discovery-token-signing-secret" is duplicated; extract it to a shared constant
and use that constant wherever the literal is used. Define a package-level
constant (e.g., DiscoveryTokenSigningSecret) in a common package/file imported
by both the registry setup and the controller, then replace the literal in the
registry.NewServer call and the usage in server_controller.go (and any other
occurrences) with that constant to ensure a single source of truth.
internal/token/token_test.go (2)

163-173: Test comment is misleading - this doesn't verify constant-time behavior.

The comment says "HMAC provides constant-time comparison internally" but the test only verifies that token verification works, not timing properties. Timing-safe comparison cannot be tested this way.

Consider renaming or updating the test description to reflect what it actually tests.

📝 Suggested description improvement
-	It("should use constant-time comparison (HMAC)", func() {
-		// HMAC provides constant-time comparison internally
-		// This test verifies the signature mechanism works
+	It("should verify tokens using HMAC signature mechanism", func() {
+		// Note: HMAC uses hmac.Equal internally for constant-time comparison
+		// This test verifies basic signature verification works
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 163 - 173, The test description is
misleading about timing; update the It description under Context("Token
Security") to reflect that it verifies signature correctness rather than
constant-time behavior: change the It("should use constant-time comparison
(HMAC)") text to something like It("verifies signed discovery token using HMAC
signature") and/or update the comment above the test; keep the test body using
GenerateSignedDiscoveryToken and VerifySignedDiscoveryToken unchanged so it
still asserts successful generation and valid verification.

67-76: Consider using explicit time manipulation instead of relying on timing.

The Eventually pattern here relies on wall-clock time passing to generate a different timestamp. This could be flaky in fast CI environments or if the timestamp resolution changes. Consider either:

  1. Mocking time in the token package
  2. Simply accepting that same-second tokens may be identical (expected behavior)

The current test is technically correct but may cause intermittent failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 67 - 76, Test relies on wall-clock
progress and can flake; update the code or test to control time
deterministically: either add a time provider to GenerateSignedDiscoveryToken
(e.g., inject a nowFunc or accept a timestamp parameter) and in the test call
GenerateSignedDiscoveryToken with two different timestamps to assert tokens
differ, or change the test to assert tokens may be equal within the same second
and avoid Eventually. Locate GenerateSignedDiscoveryToken and the token_test.go
case and implement the time-injection approach (introduce nowFunc or timestamp
arg) or alter the test to compare tokens produced with explicit different times
instead of using Eventually.
internal/controller/server_controller_test.go (2)

437-446: Fragile string parsing for token extraction.

This manual string parsing with IndexAny and hardcoded delimiters is brittle. If the ignition format changes slightly (e.g., different quoting or whitespace), this could break silently.

Consider using a regex pattern for more robust extraction, or better yet, parse the YAML structure to get the systemd unit contents and then extract the flag value.

♻️ Suggested regex-based extraction
-		// Extract token using regex or string parsing
-		tokenStart := strings.Index(contents, "--discovery-token=")
-		Expect(tokenStart).To(BeNumerically(">", 0))
-		tokenValueStart := tokenStart + len("--discovery-token=")
-		tokenEnd := strings.IndexAny(contents[tokenValueStart:], " \n\t\"'")
-		var actualToken string
-		if tokenEnd == -1 {
-			actualToken = contents[tokenValueStart:]
-		} else {
-			actualToken = contents[tokenValueStart : tokenValueStart+tokenEnd]
-		}
-		Expect(actualToken).ToNot(BeEmpty())
+		// Extract token using regex for more robust parsing
+		tokenRegex := regexp.MustCompile(`--discovery-token=([A-Za-z0-9_-]+)`)
+		matches := tokenRegex.FindStringSubmatch(contents)
+		Expect(matches).To(HaveLen(2), "should find discovery token in ignition")
+		actualToken := matches[1]
+		Expect(actualToken).ToNot(BeEmpty())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller_test.go` around lines 437 - 446, The
current brittle manual parsing of contents (variables tokenStart,
tokenValueStart, tokenEnd, actualToken) using strings.Index and IndexAny should
be replaced with a more robust extraction: use a regex to locate
--discovery-token=<value> (or parse the YAML/ignition to get the systemd unit
content first) and capture the token value, or parse the YAML structure to
retrieve the unit and then extract the flag; update the test to use that
regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic
so variations in quoting/whitespace won’t break the extraction.

271-289: Consider extracting repeated signing secret retrieval into a helper function.

The pattern of fetching the signing secret and generating a token is duplicated across multiple tests (here, lines 506-522, and lines 853-867). This could be extracted into a test helper to reduce duplication and improve maintainability.

♻️ Suggested helper function
// Add this helper near the top of the test file or in a test utilities file
func getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns, systemUUID string) string {
	signingSecret := &v1.Secret{}
	Eventually(func() error {
		return k8sClient.Get(ctx, client.ObjectKey{
			Name:      "discovery-token-signing-secret",
			Namespace: ns,
		}, signingSecret)
	}).Should(Succeed())

	signingKey := signingSecret.Data["signing-key"]
	Expect(signingKey).To(HaveLen(32))

	token, err := metaltoken.GenerateSignedDiscoveryToken(signingKey, systemUUID)
	Expect(err).NotTo(HaveOccurred())
	return token
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller_test.go` around lines 271 - 289, The
test duplicates code to fetch the "discovery-token-signing-secret" and generate
a signed discovery token; extract that logic into a helper function (e.g.,
getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns string,
systemUUID string) string) placed near the top of the test file or a test
utilities file, have it perform the Eventually k8sClient.Get into a v1.Secret,
assert signingKey length and GenerateSignedDiscoveryToken
(metaltoken.GenerateSignedDiscoveryToken), return the token, and then replace
the inline blocks in the tests (the creation of signingSecret, signingKey check,
metaltoken.GenerateSignedDiscoveryToken call, and probe.NewAgent usage sites)
with calls to getSignedDiscoveryToken to reduce duplication and keep assertions
intact.
internal/token/token.go (1)

86-97: Manual delimiter parsing could use strings.Index or strings.Split.

The manual loop to find || can be simplified using strings.Index:

♻️ Simplified delimiter parsing
-	// Parse by splitting on "||"
-	firstDelim := -1
-	for i := 0; i < len(payload)-1; i++ {
-		if payload[i] == '|' && payload[i+1] == '|' {
-			firstDelim = i
-			break
-		}
-	}
-
-	if firstDelim == -1 {
+	// Parse by finding "||" delimiter
+	firstDelim := strings.Index(payload, "||")
+	if firstDelim == -1 {
		return "", 0, false, nil // Invalid format
	}

Note: You would need to add "strings" to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token.go` around lines 86 - 97, Replace the manual loop that
searches for the "||" delimiter on the payload with strings.Index(payload, "||")
(or strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update
the code paths that use firstDelim (the existing early-return when delimiter not
found and subsequent substring logic) to use the Index/Split result, and add
"strings" to the imports. This keeps the same behavior (returning "", 0, false,
nil when not found) but removes the manual byte-by-byte scan and improves
readability around the payload delimiter handling.
docs/discovery-token-security.md (1)

13-15: Add language specifiers to fenced code blocks.

Static analysis flagged these code blocks as missing language specifiers. Adding text or plaintext for the token format and ASCII diagram would satisfy the linter and improve rendering in some viewers.

📝 Suggested fix
-```
+```text
 token = base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))

And for the sequence diagram:
```diff
-```
+```text
 ┌──────────────┐                  ┌──────────┐                  ┌──────────┐

Also applies to: 52-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 13 - 15, Update the fenced
code blocks that show the token format (the block containing "token =
base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))")
and the ASCII sequence diagram (the block starting with the box characters
"┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or
"plaintext" after the opening backticks; apply the same change to the other code
fences noted in the comment range (the blocks around lines 52–77) so the linter
accepts them and viewers render them as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 317-321: The manual command uses "openssl rand -base64 32" which
base64-encodes the 32 bytes and causes the stored signing-key to be ~44 bytes
after Kubernetes base64-ing it; instead generate 32 raw bytes, write them to a
temporary file, create the Kubernetes secret discovery-token-signing-secret
using --from-file=signing-key=<tempfile> so the exact 32-byte value is stored,
and then remove the temporary file; references: discovery-token-signing-secret,
signing-key, and avoid using openssl rand -base64 32.

In `@internal/controller/server_controller.go`:
- Around line 720-767: getOrCreateSigningSecret has two issues: concurrent
reconciles can race on creating the secret and invalid secrets are never
recovered. Modify getOrCreateSigningSecret to: when an existing secret is found
validate the "signing-key" and if invalid attempt to replace it (generate a new
key and Update/Patch the secret, handling Conflict by re-getting and
re-validating); when Create returns AlreadyExists after generating a new secret,
treat it as a benign race by fetching the newly-created secret and
validating/returning its signing-key instead of failing; in all fetch/update
paths handle transient IsNotFound/Conflict/AlreadyExists errors by re-fetching
and retrying validation once before returning an error. Reference:
getOrCreateSigningSecret, secret.Data["signing-key"], r.Create, r.Get, and use
metaltoken.GenerateSigningSecret for regeneration.

In `@internal/probe/probe.go`:
- Around line 214-219: The auth-error branch currently returns (false, nil)
which signals "not done, no error" so ExponentialBackoffWithContext keeps
retrying; change it to return a non-nil error to stop retries. In the block that
logs the auth failure (the a.log.Error call that checks resp.StatusCode ==
http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden), return a
terminal error (e.g., errors.New or fmt.Errorf with the status code and context)
instead of nil so ExponentialBackoffWithContext will cease retries.

In `@internal/registry/server.go`:
- Around line 89-96: The current validateDiscoveryToken function blindly allows
requests when s.k8sClient == nil which is risky; change the logic to require an
explicit test-mode flag instead of inferring from s.k8sClient by adding a Server
field like testMode (bool) and only bypass validation when testMode is true, and
also add a startup-time log warning if testMode is enabled; update
validateDiscoveryToken to check s.testMode (and still log that validation is
skipped) and ensure any creation/initialization paths set testMode explicitly
rather than relying on a nil k8sClient.
- Around line 104-133: validateDiscoveryToken performs lazy loading of
s.signingSecret without synchronization, causing a data race; add a synchronized
one-time loader and protect writes: introduce fields on Server (e.g.,
signingSecretOnce sync.Once and signingSecretMu sync.Mutex) and implement a
helper method loadSigningSecret() that uses signingSecretOnce.Do to fetch the
Secret via s.k8sClient, validate secret.Data["signing-key"], then assign
s.signingSecret while holding signingSecretMu; call loadSigningSecret() from
validateDiscoveryToken and only read s.signingSecret under the mutex or after
the Once to ensure safe concurrent access.

In `@internal/token/token.go`:
- Around line 40-42: Validate the systemUUID before constructing the payload to
prevent delimiter injection: in the function that builds the payload (where
systemUUID, timestamp, and payload are used) add an RFC‑4122 UUID regex check
against systemUUID and return an error if it doesn't match, so payload :=
fmt.Sprintf("%s||%d", systemUUID, timestamp) only runs for a valid UUID;
alternatively you may replace the delimiter approach with length‑prefixing or a
null byte separator and update the corresponding parsing logic (the token
parsing code referenced at lines 86-99) to use the new format.

---

Nitpick comments:
In `@cmd/main.go`:
- Around line 627-633: The secret name "discovery-token-signing-secret" is
duplicated; extract it to a shared constant and use that constant wherever the
literal is used. Define a package-level constant (e.g.,
DiscoveryTokenSigningSecret) in a common package/file imported by both the
registry setup and the controller, then replace the literal in the
registry.NewServer call and the usage in server_controller.go (and any other
occurrences) with that constant to ensure a single source of truth.

In `@docs/discovery-token-security.md`:
- Around line 13-15: Update the fenced code blocks that show the token format
(the block containing "token =
base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))")
and the ASCII sequence diagram (the block starting with the box characters
"┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or
"plaintext" after the opening backticks; apply the same change to the other code
fences noted in the comment range (the blocks around lines 52–77) so the linter
accepts them and viewers render them as plain text.

In `@internal/controller/server_controller_test.go`:
- Around line 437-446: The current brittle manual parsing of contents (variables
tokenStart, tokenValueStart, tokenEnd, actualToken) using strings.Index and
IndexAny should be replaced with a more robust extraction: use a regex to locate
--discovery-token=<value> (or parse the YAML/ignition to get the systemd unit
content first) and capture the token value, or parse the YAML structure to
retrieve the unit and then extract the flag; update the test to use that
regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic
so variations in quoting/whitespace won’t break the extraction.
- Around line 271-289: The test duplicates code to fetch the
"discovery-token-signing-secret" and generate a signed discovery token; extract
that logic into a helper function (e.g., getSignedDiscoveryToken(ctx
context.Context, k8sClient client.Client, ns string, systemUUID string) string)
placed near the top of the test file or a test utilities file, have it perform
the Eventually k8sClient.Get into a v1.Secret, assert signingKey length and
GenerateSignedDiscoveryToken (metaltoken.GenerateSignedDiscoveryToken), return
the token, and then replace the inline blocks in the tests (the creation of
signingSecret, signingKey check, metaltoken.GenerateSignedDiscoveryToken call,
and probe.NewAgent usage sites) with calls to getSignedDiscoveryToken to reduce
duplication and keep assertions intact.

In `@internal/token/token_test.go`:
- Around line 163-173: The test description is misleading about timing; update
the It description under Context("Token Security") to reflect that it verifies
signature correctness rather than constant-time behavior: change the It("should
use constant-time comparison (HMAC)") text to something like It("verifies signed
discovery token using HMAC signature") and/or update the comment above the test;
keep the test body using GenerateSignedDiscoveryToken and
VerifySignedDiscoveryToken unchanged so it still asserts successful generation
and valid verification.
- Around line 67-76: Test relies on wall-clock progress and can flake; update
the code or test to control time deterministically: either add a time provider
to GenerateSignedDiscoveryToken (e.g., inject a nowFunc or accept a timestamp
parameter) and in the test call GenerateSignedDiscoveryToken with two different
timestamps to assert tokens differ, or change the test to assert tokens may be
equal within the same second and avoid Eventually. Locate
GenerateSignedDiscoveryToken and the token_test.go case and implement the
time-injection approach (introduce nowFunc or timestamp arg) or alter the test
to compare tokens produced with explicit different times instead of using
Eventually.

In `@internal/token/token.go`:
- Around line 86-97: Replace the manual loop that searches for the "||"
delimiter on the payload with strings.Index(payload, "||") (or
strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update the
code paths that use firstDelim (the existing early-return when delimiter not
found and subsequent substring logic) to use the Index/Split result, and add
"strings" to the imports. This keeps the same behavior (returning "", 0, false,
nil when not found) but removes the manual byte-by-byte scan and improves
readability around the payload delimiter handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6424543f-9838-4804-8743-f03cb9c7032c

📥 Commits

Reviewing files that changed from the base of the PR and between 8937145 and f697e38.

📒 Files selected for processing (13)
  • cmd/main.go
  • cmd/metalprobe/main.go
  • docs/discovery-token-security.md
  • internal/api/registry/registry.go
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/controller/suite_test.go
  • internal/probe/probe.go
  • internal/probe/probe_suite_test.go
  • internal/registry/registry_suite_test.go
  • internal/registry/server.go
  • internal/token/token.go
  • internal/token/token_test.go

Comment thread docs/discovery-token-security.md Outdated
Comment thread internal/controller/server_controller.go
Comment thread internal/probe/probe.go
Comment on lines +89 to +96
func (s *Server) validateDiscoveryToken(receivedToken string) (string, bool) {
// Skip validation in unit test mode (no k8s client)
if s.k8sClient == nil {
s.log.V(1).Info("Skipping token validation (no K8s client - unit test mode)")
// In test mode, extract systemUUID from token if possible, otherwise return empty
// For now, just allow all requests in test mode
return "", true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bypassing authentication in test mode is risky.

When k8sClient == nil, all token validation is skipped and requests are allowed. While this is convenient for unit tests, it creates a risk if someone accidentally deploys with a nil client or misconfigures the server.

Consider logging a warning at startup if running in test mode, or requiring an explicit flag to enable test mode rather than inferring it from client presence.

📝 Suggested improvement
 func NewServer(logger logr.Logger, addr string, k8sClient client.Client, signingSecretName, signingSecretNamespace string) *Server {
+	if k8sClient == nil {
+		logger.Info("WARNING: Running without K8s client - token validation DISABLED (test mode only)")
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 89 - 96, The current
validateDiscoveryToken function blindly allows requests when s.k8sClient == nil
which is risky; change the logic to require an explicit test-mode flag instead
of inferring from s.k8sClient by adding a Server field like testMode (bool) and
only bypass validation when testMode is true, and also add a startup-time log
warning if testMode is enabled; update validateDiscoveryToken to check
s.testMode (and still log that validation is skipped) and ensure any
creation/initialization paths set testMode explicitly rather than relying on a
nil k8sClient.

Comment thread internal/registry/server.go Outdated
Comment thread internal/token/token.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
docs/discovery-token-security.md (1)

339-342: ⚠️ Potential issue | 🟠 Major

Manual secret creation command still produces incorrect key length.

As previously flagged, openssl rand -base64 32 generates 32 random bytes and base64-encodes them to ~44 characters. When stored via --from-literal, kubectl base64-encodes this again, so the decoded signing-key will be 44 bytes, not the required 32 bytes.

🔧 Corrected command
-# Or manually create the secret
-kubectl create secret generic discovery-token-signing-secret \
-  --from-literal=signing-key=$(openssl rand -base64 32) \
-  -n metal-operator-system
+# Or manually create the secret (generate 32 raw bytes)
+kubectl create secret generic discovery-token-signing-secret \
+  --from-file=signing-key=<(openssl rand 32) \
+  -n metal-operator-system
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 339 - 342, The current command
uses openssl rand -base64 32 with --from-literal which yields a 44-byte base64
string that kubectl base64-encodes again, producing an incorrect 44-byte decoded
signing-key; instead generate raw 32 bytes to a file and create the secret from
that file so Kubernetes stores exactly 32 bytes: run openssl rand 32 >
signing-key.bin and then kubectl create secret generic
discovery-token-signing-secret --from-file=signing-key=signing-key.bin -n
metal-operator-system (then securely remove signing-key.bin); update the
documented command that references openssl rand -base64 32, --from-literal, and
signing-key accordingly.
🧹 Nitpick comments (2)
internal/token/token_test.go (1)

69-78: Consider using time.Sleep or handling the error in Eventually.

The test swallows the error from GenerateSignedDiscoveryToken on line 75. If token generation fails, the test would hang until timeout rather than fail fast with a clear error.

♻️ Proposed fix to handle error explicitly
 		It("should generate different tokens at different times", func() {
 			token1, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
 			Expect(err).NotTo(HaveOccurred())

-			// Small delay to ensure different timestamp
-			Eventually(func() string {
-				token2, _ := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
-				return token2
-			}).ShouldNot(Equal(token1))
+			// Wait to ensure different timestamp (JWT uses second precision)
+			time.Sleep(1100 * time.Millisecond)
+			
+			token2, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(token2).NotTo(Equal(token1))
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 69 - 78, The test swallows the
error from GenerateSignedDiscoveryToken inside Eventually causing silent
failures; update the "should generate different tokens at different times" test
to handle errors explicitly — either insert a short time.Sleep between calls and
call GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred())
before comparing token1 vs token2, or change the Eventually invocation to call a
helper that returns (string, error) and assert on the error immediately;
reference GenerateSignedDiscoveryToken, token1/token2 and signingSecret when
making the change so the test fails fast on token generation errors instead of
hanging.
docs/discovery-token-security.md (1)

13-15: Add language specifiers to fenced code blocks for better rendering.

The markdown linter flags code blocks without language specifiers. For non-code text blocks (token format examples and ASCII diagrams), use the text specifier.

📐 Proposed fix
-```
+```text
 token = header.payload.signature

JWT Structure:

- +text
..

And for the ASCII diagram around line 68:

-```
+```text
 ┌──────────────┐                  ┌──────────┐                  ┌──────────┐
 │  Controller  │                  │ Registry │                  │  Server  │
 └──────┬───────┘                  └────┬─────┘                  └────┬─────┘

Also applies to: 19-21, 68-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 13 - 15, The fenced code
blocks in docs/discovery-token-security.md (examples like the line "token =
header.payload.signature", the "JWT Structure" block showing
"<base64-header>.<base64-payload>.<base64-signature>" and the ASCII diagram
starting with "┌──────────────┐") lack language specifiers and fail the markdown
linter; update each such fenced block (including the blocks around lines 19-21
and 68-94) to use the text language specifier (i.e., replace ``` with ```text
for the token example, the JWT Structure example, and the ASCII diagram blocks)
so they render correctly and satisfy the linter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 441-447: The "24-hour token lifetime (increased from 1 hour)"
statement under the "Not Needed (Now Implemented)" section is inconsistent with
the rest of the document and code examples that use a 1-hour expiry; update that
statement to reflect a 1-hour token lifetime (or alternatively update the
code/examples and the "Why 1 hour?" section if you intend to change to 24
hours). Specifically, edit the text under the "Not Needed (Now Implemented)"
heading to say "1-hour token lifetime" (and remove the "(increased from 1 hour)"
clause) and verify the JWT example and the "Why 1 hour?" section wording match
this value so the document is consistent.
- Around line 28-36: In the "Claims (Payload):" section remove the "-
**secret**: 32-byte (256-bit) shared secret..." bullet since the signing secret
is not a JWT claim; keep only "- **sub**", "- **iat**", and "- **exp**" as the
transmitted claims, and instead add a short note near the "Signature:
HMAC-SHA256(header + payload, secret)" line clarifying that the secret is used
only to compute/verify the HMAC and is not included in the token payload.

---

Duplicate comments:
In `@docs/discovery-token-security.md`:
- Around line 339-342: The current command uses openssl rand -base64 32 with
--from-literal which yields a 44-byte base64 string that kubectl base64-encodes
again, producing an incorrect 44-byte decoded signing-key; instead generate raw
32 bytes to a file and create the secret from that file so Kubernetes stores
exactly 32 bytes: run openssl rand 32 > signing-key.bin and then kubectl create
secret generic discovery-token-signing-secret
--from-file=signing-key=signing-key.bin -n metal-operator-system (then securely
remove signing-key.bin); update the documented command that references openssl
rand -base64 32, --from-literal, and signing-key accordingly.

---

Nitpick comments:
In `@docs/discovery-token-security.md`:
- Around line 13-15: The fenced code blocks in docs/discovery-token-security.md
(examples like the line "token = header.payload.signature", the "JWT Structure"
block showing "<base64-header>.<base64-payload>.<base64-signature>" and the
ASCII diagram starting with "┌──────────────┐") lack language specifiers and
fail the markdown linter; update each such fenced block (including the blocks
around lines 19-21 and 68-94) to use the text language specifier (i.e., replace
``` with ```text for the token example, the JWT Structure example, and the ASCII
diagram blocks) so they render correctly and satisfy the linter.

In `@internal/token/token_test.go`:
- Around line 69-78: The test swallows the error from
GenerateSignedDiscoveryToken inside Eventually causing silent failures; update
the "should generate different tokens at different times" test to handle errors
explicitly — either insert a short time.Sleep between calls and call
GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred()) before
comparing token1 vs token2, or change the Eventually invocation to call a helper
that returns (string, error) and assert on the error immediately; reference
GenerateSignedDiscoveryToken, token1/token2 and signingSecret when making the
change so the test fails fast on token generation errors instead of hanging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34fd7cba-42bb-49b5-a86a-77dfe6d83e10

📥 Commits

Reviewing files that changed from the base of the PR and between f697e38 and 6388ba3.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • docs/discovery-token-security.md
  • go.mod
  • internal/token/token.go
  • internal/token/token_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/token/token.go

Comment thread docs/discovery-token-security.md
Comment thread docs/discovery-token-security.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/registry/server.go (1)

310-319: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference when k8sClient is nil and payload.Booted is true.

In test mode (k8sClient == nil), token validation passes at Lines 294-299. If payload.Booted is true, the code reaches Line 315 which calls s.k8sClient.List(), causing a panic.

🐛 Proposed fix to guard against nil client
 	s.log.Info("Received boot state for system", "SystemUUID", payload.SystemUUID, "BootState", payload.Booted)
 	if !payload.Booted {
 		w.WriteHeader(http.StatusOK)
 		return
 	}
+	// In test mode without k8s client, we cannot update ServerBootConfiguration
+	if s.k8sClient == nil {
+		s.log.Info("Running in test mode, skipping ServerBootConfiguration update")
+		w.WriteHeader(http.StatusOK)
+		return
+	}
 	var servers metalv1alpha1.ServerList
 	if err := s.k8sClient.List(r.Context(), &servers, client.MatchingFields{"spec.systemUUID": payload.SystemUUID}); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 310 - 319, The handler panics when
payload.Booted is true because s.k8sClient can be nil in test mode; before
calling s.k8sClient.List(...) check for s.k8sClient == nil and short-circuit
(e.g., write a success response and return or return a controlled error) to
avoid dereferencing nil. Update the code path that uses s.k8sClient.List,
referencing s.k8sClient and the payload.Booted check, to handle the nil client
case (log a clear message and respond without calling List).
♻️ Duplicate comments (1)
internal/controller/server_controller.go (1)

730-736: ⚠️ Potential issue | 🟠 Major

Invalid secret still causes permanent failure without auto-recovery.

When the signing secret exists but has an invalid key (wrong length or missing), this path returns an error without attempting to regenerate. This will cause permanent reconciliation failures until manual intervention, while the AlreadyExists path (Lines 778-790) does attempt recovery.

Consider applying the same regeneration logic here for consistency:

🐛 Proposed fix to add auto-recovery
 	if err == nil {
 		// Secret exists, return the signing key
 		if signingKey, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]; ok && len(signingKey) == 32 {
 			return signingKey, nil
 		}
-		// Secret exists but is invalid, regenerate
-		return nil, fmt.Errorf("existing signing secret is invalid")
+		// Secret exists but is invalid, update it with a valid key
+		ctrl.LoggerFrom(ctx).Info("Existing signing secret is invalid, regenerating", "name", secretName)
+		signingKey, err := metaltoken.GenerateSigningSecret()
+		if err != nil {
+			return nil, fmt.Errorf("failed to generate signing secret: %w", err)
+		}
+		secret.Data = map[string][]byte{metaltoken.DiscoveryTokenSigningSecretKey: signingKey}
+		if err := r.Update(ctx, secret); err != nil {
+			return nil, fmt.Errorf("failed to update signing secret: %w", err)
+		}
+		return signingKey, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller.go` around lines 730 - 736, The current
branch inside the secret-fetch logic returns an error when
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing or not 32
bytes, causing permanent reconciliation failure; update the block in the
function handling secret retrieval so that if signingKey is missing/invalid you
treat it like a corrupted secret and trigger the same regeneration/recreate
logic used in the AlreadyExists path (the code that handles creating a new
signing secret), i.e., delete or overwrite the invalid secret and call the
existing regeneration routine to create a valid 32-byte signing key rather than
returning fmt.Errorf("existing signing secret is invalid").
🧹 Nitpick comments (1)
internal/registry/server.go (1)

144-151: Test mode bypass logs at V(1) which may not be visible.

When running without a K8s client, authentication is completely bypassed. The warning is logged at verbosity level 1, which may not be visible in default logging configurations. Consider logging at Info level (V(0)) to ensure operators notice this security-sensitive configuration.

📝 Suggested improvement
 	// Skip validation in test mode (no k8s client)
 	if s.k8sClient == nil {
-		s.log.V(1).Info("Running in TEST MODE without K8s client - skipping secret loading")
+		s.log.Info("WARNING: Running in TEST MODE without K8s client - token validation DISABLED")
 		// In test mode, extract systemUUID from token if possible, otherwise return empty
 		// For now, just allow all requests in test mode
 		return "", true
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 144 - 151, The test-mode bypass in
validateDiscoveryToken currently logs the security-sensitive message at
verbosity V(1) which may be hidden; update the logging call on the
s.k8sClient==nil branch (the s.log.V(1).Info(...) invocation inside
validateDiscoveryToken) to use the default/info level (e.g. s.log.Info(...) or
s.log.V(0).Info(...)) and make the message explicit (mention "TEST MODE -
authentication bypassed, no K8s client") so operators will see it in normal log
configurations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/probe/probe.go`:
- Around line 223-226: The log call is passing a nil `err` when the HTTP request
returned a non-success status; replace that by creating a real error describing
the response (e.g. using fmt.Errorf to include resp.StatusCode and optionally
the response body) and pass that error to `a.log.Error`, or alternatively call
`a.log.Error` without an error parameter and include the status and
`a.RegistryURL` in the message; target the block that checks `resp.StatusCode`
(the `if resp.StatusCode != http.StatusOK && resp.StatusCode !=
http.StatusCreated { ... }`), update the logging to use the newly constructed
error (or message) instead of the nil `err` so the log is informative and
non-nil.

---

Outside diff comments:
In `@internal/registry/server.go`:
- Around line 310-319: The handler panics when payload.Booted is true because
s.k8sClient can be nil in test mode; before calling s.k8sClient.List(...) check
for s.k8sClient == nil and short-circuit (e.g., write a success response and
return or return a controlled error) to avoid dereferencing nil. Update the code
path that uses s.k8sClient.List, referencing s.k8sClient and the payload.Booted
check, to handle the nil client case (log a clear message and respond without
calling List).

---

Duplicate comments:
In `@internal/controller/server_controller.go`:
- Around line 730-736: The current branch inside the secret-fetch logic returns
an error when secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing
or not 32 bytes, causing permanent reconciliation failure; update the block in
the function handling secret retrieval so that if signingKey is missing/invalid
you treat it like a corrupted secret and trigger the same regeneration/recreate
logic used in the AlreadyExists path (the code that handles creating a new
signing secret), i.e., delete or overwrite the invalid secret and call the
existing regeneration routine to create a valid 32-byte signing key rather than
returning fmt.Errorf("existing signing secret is invalid").

---

Nitpick comments:
In `@internal/registry/server.go`:
- Around line 144-151: The test-mode bypass in validateDiscoveryToken currently
logs the security-sensitive message at verbosity V(1) which may be hidden;
update the logging call on the s.k8sClient==nil branch (the s.log.V(1).Info(...)
invocation inside validateDiscoveryToken) to use the default/info level (e.g.
s.log.Info(...) or s.log.V(0).Info(...)) and make the message explicit (mention
"TEST MODE - authentication bypassed, no K8s client") so operators will see it
in normal log configurations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c4ad7fa-b741-4419-82bd-b7a396b8689d

📥 Commits

Reviewing files that changed from the base of the PR and between 6388ba3 and f7cff68.

📒 Files selected for processing (8)
  • cmd/main.go
  • docs/discovery-token-security.md
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/probe/probe.go
  • internal/registry/server.go
  • internal/token/constants.go
  • internal/token/token_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/discovery-token-security.md
  • internal/token/constants.go

Comment thread internal/probe/probe.go
@stefanhipfel stefanhipfel changed the title Secure discovery boot data with HMAC-SHA256 signed tokens (Issue #749) Secure discovery boot data with jwt signed tokens (Issue #749) Mar 24, 2026
@afritzler afritzler changed the title Secure discovery boot data with jwt signed tokens (Issue #749) Secure discovery boot data with jwt signed tokens Mar 30, 2026
Resolved merge conflicts between JWT authentication changes and
registry client timeout refactoring:

- Combined discoveryToken parameter with registryClientTimeout in probe.Agent
- Updated NewAgent signature to include both authentication and timeout
- Updated all test instantiations to include both parameters
- Preserved JWT validation in bootstate endpoint
- Maintained consistent error logging format

Conflicts resolved in:
- cmd/metalprobe/main.go
- internal/probe/probe.go
- internal/probe/probe_suite_test.go
- internal/controller/server_controller_test.go
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/controller/suite_test.go (1)

336-342: Use the token.DiscoveryTokenSigningSecretName constant instead of hardcoding the name.

cmd/main.go already passes metaltoken.DiscoveryTokenSigningSecretName (see internal/token/constants.go). Duplicating the literal string here creates a drift hazard if the constant is ever renamed. Import internal/token in this test file and reference the constant for consistency.

♻️ Proposed change
 	"github.com/ironcore-dev/metal-operator/internal/registry"
+	metaltoken "github.com/ironcore-dev/metal-operator/internal/token"
 			registryServer := registry.NewServer(
 				GinkgoLogr,
 				":30000",
 				k8sManager.GetClient(),
-				"discovery-token-signing-secret",
+				metaltoken.DiscoveryTokenSigningSecretName,
 				ns.Name,
 			)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/suite_test.go` around lines 336 - 342, Replace the
hard-coded "discovery-token-signing-secret" literal passed to registry.NewServer
in the suite_test setup with the shared constant by importing the internal/token
package and using token.DiscoveryTokenSigningSecretName; specifically update the
call to registry.NewServer(GinkgoLogr, ":30000", k8sManager.GetClient(),
token.DiscoveryTokenSigningSecretName, ns.Name) and add the token import to the
test file so the test references the canonical constant instead of duplicating
the string.
internal/registry/server.go (1)

44-73: Preload in NewServer is best-effort — consider logging at a higher level when it silently fails.

If the Secret doesn't exist yet at registry startup (the common first-boot case), err is logged at Error but the server still starts with signingSecret == nil. Combined with the sync.Once issue on the on-demand loader, the only signal of misconfiguration is a single startup log line that may be easy to miss. Once the on-demand loader is fixed to retry, this preload can arguably be simplified to a log-only probe (or removed entirely), since the lazy loader will handle it correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 44 - 73, The preload of the
discovery signing secret in NewServer currently logs the Get error at Error but
leaves the server running with signingSecret nil and a silent startup state;
update the preload to treat missing/invalid secret as a higher-visibility
condition and not just a low-level error: in the NewServer preload block (the
code that calls k8sClient.Get and inspects
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]) change the
error/invalid-secret logging to a more prominent level (e.g., logger.Warn or
logger.Info with clear context) and include a single clear message that token
validation will fail until the secret is available; alternatively
simplify/remove the eager load and keep only a prominent probe/log statement so
the on-demand loader (the sync.Once lazy loader) is the definitive mechanism for
retrieval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/server_controller_test.go`:
- Around line 442-459: The test reads the ignition secret and extracts the
discovery token before the Eventually assertion, allowing a reconciler run to
re-hash the password and make the MatchYAML comparison fail; move the secret
read, tokenRegex extraction and call to ignition.GenerateIgnitionDataFromFile
(which builds ignitionData using the extracted token) into the Eventually
closure so both actualToken and expected ignitionData are computed atomically
with the final secret state (references: DefaultIgnitionSecretKeyName,
tokenRegex, ignition.GenerateIgnitionDataFromFile, and the test's Eventually
block) — update the closure to re-extract
ignitionSecret.Data[DefaultIgnitionSecretKeyName], compute actualToken,
regenerate ignitionData, then perform the MatchYAML assertion.

In `@internal/registry/server.go`:
- Around line 89-139: The current use of sync.Once in loadSigningSecret along
with setting signingSecretErr causes permanent caching of failures so the server
never recovers; change the approach to remove/stop using signingSecretOnce and
instead use a mutex-guarded loader that only memoizes successful loads: let
getSigningSecret check the cached s.signingSecret under signingSecretMu (fast
read path) and if empty/invalid attempt to fetch from k8s (using
s.k8sClient.Get) while holding a write lock or using a separate load mutex, set
s.signingSecret only on success and clear/avoid setting signingSecretErr on
transient failures so subsequent calls retry; ensure compatibility with
NewServer which may pre-populate s.signingSecret by skipping reload when
s.signingSecret is already valid.

---

Nitpick comments:
In `@internal/controller/suite_test.go`:
- Around line 336-342: Replace the hard-coded "discovery-token-signing-secret"
literal passed to registry.NewServer in the suite_test setup with the shared
constant by importing the internal/token package and using
token.DiscoveryTokenSigningSecretName; specifically update the call to
registry.NewServer(GinkgoLogr, ":30000", k8sManager.GetClient(),
token.DiscoveryTokenSigningSecretName, ns.Name) and add the token import to the
test file so the test references the canonical constant instead of duplicating
the string.

In `@internal/registry/server.go`:
- Around line 44-73: The preload of the discovery signing secret in NewServer
currently logs the Get error at Error but leaves the server running with
signingSecret nil and a silent startup state; update the preload to treat
missing/invalid secret as a higher-visibility condition and not just a low-level
error: in the NewServer preload block (the code that calls k8sClient.Get and
inspects secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]) change the
error/invalid-secret logging to a more prominent level (e.g., logger.Warn or
logger.Info with clear context) and include a single clear message that token
validation will fail until the secret is available; alternatively
simplify/remove the eager load and keep only a prominent probe/log statement so
the on-demand loader (the sync.Once lazy loader) is the definitive mechanism for
retrieval.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f6d3222-c160-4f30-9451-68aa4354b25c

📥 Commits

Reviewing files that changed from the base of the PR and between f7cff68 and 6a8c364.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (9)
  • cmd/main.go
  • cmd/metalprobe/main.go
  • go.mod
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/controller/suite_test.go
  • internal/probe/probe.go
  • internal/probe/probe_suite_test.go
  • internal/registry/server.go
✅ Files skipped from review due to trivial changes (1)
  • go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/probe/probe_suite_test.go
  • cmd/metalprobe/main.go
  • internal/probe/probe.go

Comment thread internal/controller/server_controller_test.go Outdated
Comment thread internal/registry/server.go Outdated
This commit addresses 12 issues identified by CodeRabbitAI:

**Critical/Major Fixes:**
1. Fix probe authentication retry logic - return terminal error on 401/403
   instead of (false, nil) which caused infinite retries
2. Fix registry secret loading race condition - replace sync.Once with
   mutex-guarded loader that only memoizes success, allowing recovery from
   transient failures
3. Fix controller secret creation race - handle AlreadyExists errors and
   regenerate invalid secrets instead of permanent failure
4. Fix test race condition - move bcrypt hash and token extraction inside
   Eventually closure to prevent stale data mismatches

**Improvements:**
5. Fix nil error logging in probe - change Error to Info for non-success
   HTTP status codes where err is nil
6. Add UUID validation - validate RFC 4122 format before token generation
   to prevent delimiter injection
7. Add test mode warning - log warning when registry starts without K8s
   client (token validation disabled)
8. Update test fixtures - use valid RFC 4122 UUIDs in token tests

All tests passing. No breaking changes.

Fixes issues identified in PR review:
- Race conditions in concurrent reconciliation
- Permanent failure caching with sync.Once
- Auth retry logic bug
- Test flakiness with bcrypt hashes
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/server_controller.go`:
- Around line 817-843: The fetched Secret in the AlreadyExists branch may have
secret.Data == nil causing a panic on assignment; before assigning into
secret.Data in the recovery path (after calling r.Get and after generating
signingKey via metaltoken.GenerateSigningSecret), ensure secret.Data is
initialized (e.g., set to make(map[string][]byte) when nil) and then assign
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey, then call
r.Update to persist the replacement key.

In `@internal/registry/server.go`:
- Around line 95-176: The getSigningSecret and validateDiscoveryToken functions
must accept a context so request cancellation and shutdown are respected: change
signature to getSigningSecret(ctx context.Context) and
validateDiscoveryToken(ctx context.Context, receivedToken string), remove
context.Background() and pass ctx into s.k8sClient.Get; update all call sites
(e.g., HTTP handlers that call validateDiscoveryToken) to pass r.Context(). Also
implement a single-retry secret refresh on signature/verification failures
inside validateDiscoveryToken: on VerifySignedDiscoveryToken returning an error
or invalid, clear the cached s.signingSecret under s.signingSecretMu lock, call
getSigningSecret(ctx) again to reload the secret, retry
VerifySignedDiscoveryToken once, and only then reject the request if still
failing; keep log calls as appropriate and preserve the 32-byte validation logic
in getSigningSecret.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d175cecf-f0cd-4901-b76f-412526c941ba

📥 Commits

Reviewing files that changed from the base of the PR and between 6a8c364 and 986d7cc.

📒 Files selected for processing (6)
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/probe/probe.go
  • internal/registry/server.go
  • internal/token/token.go
  • internal/token/token_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/probe/probe.go
  • internal/controller/server_controller_test.go
  • internal/token/token_test.go

Comment on lines +817 to +843
if err := r.Create(ctx, newSecret); err != nil {
if apierrors.IsAlreadyExists(err) {
// Another reconciler won the race - fetch the existing secret
ctrl.LoggerFrom(ctx).Info("Signing secret already exists, retrieving it")

secret := &v1.Secret{}
if err := r.Get(ctx, types.NamespacedName{
Name: secretName,
Namespace: r.ManagerNamespace,
}, secret); err != nil {
return nil, fmt.Errorf("failed to retrieve existing signing secret: %w", err)
}

if signingKey, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]; ok && len(signingKey) == 32 {
return signingKey, nil
}

// Existing secret is invalid - update it
signingKey, err := metaltoken.GenerateSigningSecret()
if err != nil {
return nil, fmt.Errorf("failed to generate replacement signing secret: %w", err)
}

secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey
if err := r.Update(ctx, secret); err != nil {
return nil, fmt.Errorf("failed to update invalid signing secret: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initialize secret.Data before writing the replacement key.

In the AlreadyExists recovery path, the fetched Secret can exist with Data == nil if another actor created an empty Secret between the initial Get and Create. Line 840 would then panic with assignment to entry in nil map; initialize the map before assigning.

🐛 Proposed fix
 			// Existing secret is invalid - update it
 			signingKey, err := metaltoken.GenerateSigningSecret()
 			if err != nil {
 				return nil, fmt.Errorf("failed to generate replacement signing secret: %w", err)
 			}
 
+			if secret.Data == nil {
+				secret.Data = map[string][]byte{}
+			}
 			secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey
 			if err := r.Update(ctx, secret); err != nil {
 				return nil, fmt.Errorf("failed to update invalid signing secret: %w", err)
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller.go` around lines 817 - 843, The fetched
Secret in the AlreadyExists branch may have secret.Data == nil causing a panic
on assignment; before assigning into secret.Data in the recovery path (after
calling r.Get and after generating signingKey via
metaltoken.GenerateSigningSecret), ensure secret.Data is initialized (e.g., set
to make(map[string][]byte) when nil) and then assign
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey, then call
r.Update to persist the replacement key.

Comment on lines +95 to +176
func (s *Server) getSigningSecret() ([]byte, error) {
// Fast path: check if we already have a valid cached secret (read lock)
s.signingSecretMu.RLock()
if len(s.signingSecret) == 32 {
secret := s.signingSecret
s.signingSecretMu.RUnlock()
return secret, nil
}
s.signingSecretMu.RUnlock()

// Slow path: need to load the secret (write lock)
if s.signingSecretName == "" || s.signingSecretNs == "" {
return nil, fmt.Errorf("signing secret name or namespace not configured")
}

// Attempt to load the secret from Kubernetes
ctx := context.Background()
secret := &corev1.Secret{}
err := s.k8sClient.Get(ctx, client.ObjectKey{
Name: s.signingSecretName,
Namespace: s.signingSecretNs,
}, secret)

if err != nil {
return nil, fmt.Errorf("failed to load signing secret: %w", err)
}

key, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]
if !ok || len(key) != 32 {
return nil, fmt.Errorf("signing secret invalid or missing")
}

// Cache the loaded secret (write lock)
s.signingSecretMu.Lock()
s.signingSecret = key
s.signingSecretMu.Unlock()

s.log.Info("Loaded discovery token signing secret", "name", s.signingSecretName)
return key, nil
}

// validateDiscoveryToken verifies a JWT-signed discovery token.
// Returns (systemUUID, valid) where systemUUID is extracted from the token.
// If k8sClient is nil (unit test mode), validation is skipped.
func (s *Server) validateDiscoveryToken(receivedToken string) (string, bool) {
// Skip validation in test mode (no k8s client)
if s.k8sClient == nil {
s.log.V(1).Info("Running in TEST MODE without K8s client - skipping secret loading")
// In test mode, extract systemUUID from token if possible, otherwise return empty
// For now, just allow all requests in test mode
return "", true
}

// Reject if token is missing
if receivedToken == "" {
s.log.Info("Rejected request with missing discovery token")
return "", false
}

// Get signing secret (thread-safe, loads on first call)
secret, err := s.getSigningSecret()
if err != nil {
s.log.Error(err, "Signing secret not available")
return "", false
}

// Verify the signed token
systemUUID, timestamp, valid, err := metaltoken.VerifySignedDiscoveryToken(secret, receivedToken)
if err != nil {
s.log.Error(err, "Error verifying discovery token")
return "", false
}

if !valid {
s.log.Info("Rejected request with invalid discovery token", "tokenLength", len(receivedToken))
return "", false
}

// Token is valid
s.log.V(1).Info("Validated discovery token", "systemUUID", systemUUID, "timestamp", timestamp)
return systemUUID, true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify registry token validation no longer uses context.Background in the request path.

rg -n -C3 'context\.Background\(\)|validateDiscoveryToken\(|getSigningSecret\(' --type=go internal/registry/server.go

Repository: ironcore-dev/metal-operator

Length of output: 2438


Pass request context through secret loading and add secret refresh on validation failure.

The getSigningSecret() method uses context.Background() when called from HTTP handlers (lines 155, 201, 289), preventing request cancellation and respecting server shutdown. Additionally, once a valid 32-byte secret is cached, it is never reloaded; if the controller repairs or recreates the signing Secret, validation fails for all new tokens until the registry restarts.

Required changes:

  • Add ctx context.Context parameter to getSigningSecret(ctx) and remove the context.Background() call at line 111
  • Add ctx context.Context parameter to validateDiscoveryToken(ctx, receivedToken)
  • Update handler call sites (lines 201, 289) to pass r.Context() to validateDiscoveryToken()
  • When token validation fails due to signature errors, consider reloading the Secret from Kubernetes before rejecting the request, or implement a TTL/watch mechanism to refresh on Secret changes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 95 - 176, The getSigningSecret and
validateDiscoveryToken functions must accept a context so request cancellation
and shutdown are respected: change signature to getSigningSecret(ctx
context.Context) and validateDiscoveryToken(ctx context.Context, receivedToken
string), remove context.Background() and pass ctx into s.k8sClient.Get; update
all call sites (e.g., HTTP handlers that call validateDiscoveryToken) to pass
r.Context(). Also implement a single-retry secret refresh on
signature/verification failures inside validateDiscoveryToken: on
VerifySignedDiscoveryToken returning an error or invalid, clear the cached
s.signingSecret under s.signingSecretMu lock, call getSigningSecret(ctx) again
to reload the secret, retry VerifySignedDiscoveryToken once, and only then
reject the request if still failing; keep log calls as appropriate and preserve
the 32-byte validation logic in getSigningSecret.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metal-automation documentation Improvements or additions to documentation size/XXL

Projects

Status: Backlog
Status: No status

Development

Successfully merging this pull request may close these issues.

Secure discovery boot data with per-server tokens

2 participants