Secure discovery boot data with jwt signed tokens#750
Secure discovery boot data with jwt signed tokens#750stefanhipfel wants to merge 7 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughController generates per-server HMAC‑HS256 discovery tokens stored in a Kubernetes Secret, injects them into ignition; metalprobe presents the token to registry endpoints; registry loads the signing secret and validates tokens on /register and /bootstate, rejecting invalid or mismatched tokens. Changes
Sequence DiagramsequenceDiagram
participant K8s as Kubernetes
participant Ctrl as Controller
participant Probe as Metalprobe
participant Registry as Registry Server
Ctrl->>K8s: getOrCreateSigningSecret()
K8s-->>Ctrl: signing-key (32 bytes)
Ctrl->>Ctrl: GenerateSignedDiscoveryToken(signing-key, systemUUID)
Ctrl-->>K8s: Apply ignition with --discovery-token=token
K8s-->>Probe: Boot with token flag
Probe->>Registry: POST /register (DiscoveryToken)
Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
alt Token Valid
Registry-->>Probe: 200 OK
else Token Invalid/Missing
Registry-->>Probe: 401 Unauthorized
end
Probe->>Registry: POST /bootstate (DiscoveryToken)
Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
alt Token Valid
Registry-->>Probe: 200 OK
else Token Invalid/Missing
Registry-->>Probe: 401 Unauthorized
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (7)
cmd/main.go (1)
627-633: Consider extracting the hardcoded secret name to a shared constant.The secret name
"discovery-token-signing-secret"is duplicated here and inserver_controller.go(line 723). Extracting it to a shared constant would prevent accidental drift between the controller and registry server configurations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/main.go` around lines 627 - 633, The secret name "discovery-token-signing-secret" is duplicated; extract it to a shared constant and use that constant wherever the literal is used. Define a package-level constant (e.g., DiscoveryTokenSigningSecret) in a common package/file imported by both the registry setup and the controller, then replace the literal in the registry.NewServer call and the usage in server_controller.go (and any other occurrences) with that constant to ensure a single source of truth.internal/token/token_test.go (2)
163-173: Test comment is misleading - this doesn't verify constant-time behavior.The comment says "HMAC provides constant-time comparison internally" but the test only verifies that token verification works, not timing properties. Timing-safe comparison cannot be tested this way.
Consider renaming or updating the test description to reflect what it actually tests.
📝 Suggested description improvement
- It("should use constant-time comparison (HMAC)", func() { - // HMAC provides constant-time comparison internally - // This test verifies the signature mechanism works + It("should verify tokens using HMAC signature mechanism", func() { + // Note: HMAC uses hmac.Equal internally for constant-time comparison + // This test verifies basic signature verification works🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/token/token_test.go` around lines 163 - 173, The test description is misleading about timing; update the It description under Context("Token Security") to reflect that it verifies signature correctness rather than constant-time behavior: change the It("should use constant-time comparison (HMAC)") text to something like It("verifies signed discovery token using HMAC signature") and/or update the comment above the test; keep the test body using GenerateSignedDiscoveryToken and VerifySignedDiscoveryToken unchanged so it still asserts successful generation and valid verification.
67-76: Consider using explicit time manipulation instead of relying on timing.The
Eventuallypattern here relies on wall-clock time passing to generate a different timestamp. This could be flaky in fast CI environments or if the timestamp resolution changes. Consider either:
- Mocking time in the token package
- Simply accepting that same-second tokens may be identical (expected behavior)
The current test is technically correct but may cause intermittent failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/token/token_test.go` around lines 67 - 76, Test relies on wall-clock progress and can flake; update the code or test to control time deterministically: either add a time provider to GenerateSignedDiscoveryToken (e.g., inject a nowFunc or accept a timestamp parameter) and in the test call GenerateSignedDiscoveryToken with two different timestamps to assert tokens differ, or change the test to assert tokens may be equal within the same second and avoid Eventually. Locate GenerateSignedDiscoveryToken and the token_test.go case and implement the time-injection approach (introduce nowFunc or timestamp arg) or alter the test to compare tokens produced with explicit different times instead of using Eventually.internal/controller/server_controller_test.go (2)
437-446: Fragile string parsing for token extraction.This manual string parsing with
IndexAnyand hardcoded delimiters is brittle. If the ignition format changes slightly (e.g., different quoting or whitespace), this could break silently.Consider using a regex pattern for more robust extraction, or better yet, parse the YAML structure to get the systemd unit contents and then extract the flag value.
♻️ Suggested regex-based extraction
- // Extract token using regex or string parsing - tokenStart := strings.Index(contents, "--discovery-token=") - Expect(tokenStart).To(BeNumerically(">", 0)) - tokenValueStart := tokenStart + len("--discovery-token=") - tokenEnd := strings.IndexAny(contents[tokenValueStart:], " \n\t\"'") - var actualToken string - if tokenEnd == -1 { - actualToken = contents[tokenValueStart:] - } else { - actualToken = contents[tokenValueStart : tokenValueStart+tokenEnd] - } - Expect(actualToken).ToNot(BeEmpty()) + // Extract token using regex for more robust parsing + tokenRegex := regexp.MustCompile(`--discovery-token=([A-Za-z0-9_-]+)`) + matches := tokenRegex.FindStringSubmatch(contents) + Expect(matches).To(HaveLen(2), "should find discovery token in ignition") + actualToken := matches[1] + Expect(actualToken).ToNot(BeEmpty())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller_test.go` around lines 437 - 446, The current brittle manual parsing of contents (variables tokenStart, tokenValueStart, tokenEnd, actualToken) using strings.Index and IndexAny should be replaced with a more robust extraction: use a regex to locate --discovery-token=<value> (or parse the YAML/ignition to get the systemd unit content first) and capture the token value, or parse the YAML structure to retrieve the unit and then extract the flag; update the test to use that regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic so variations in quoting/whitespace won’t break the extraction.
271-289: Consider extracting repeated signing secret retrieval into a helper function.The pattern of fetching the signing secret and generating a token is duplicated across multiple tests (here, lines 506-522, and lines 853-867). This could be extracted into a test helper to reduce duplication and improve maintainability.
♻️ Suggested helper function
// Add this helper near the top of the test file or in a test utilities file func getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns, systemUUID string) string { signingSecret := &v1.Secret{} Eventually(func() error { return k8sClient.Get(ctx, client.ObjectKey{ Name: "discovery-token-signing-secret", Namespace: ns, }, signingSecret) }).Should(Succeed()) signingKey := signingSecret.Data["signing-key"] Expect(signingKey).To(HaveLen(32)) token, err := metaltoken.GenerateSignedDiscoveryToken(signingKey, systemUUID) Expect(err).NotTo(HaveOccurred()) return token }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller_test.go` around lines 271 - 289, The test duplicates code to fetch the "discovery-token-signing-secret" and generate a signed discovery token; extract that logic into a helper function (e.g., getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns string, systemUUID string) string) placed near the top of the test file or a test utilities file, have it perform the Eventually k8sClient.Get into a v1.Secret, assert signingKey length and GenerateSignedDiscoveryToken (metaltoken.GenerateSignedDiscoveryToken), return the token, and then replace the inline blocks in the tests (the creation of signingSecret, signingKey check, metaltoken.GenerateSignedDiscoveryToken call, and probe.NewAgent usage sites) with calls to getSignedDiscoveryToken to reduce duplication and keep assertions intact.internal/token/token.go (1)
86-97: Manual delimiter parsing could use strings.Index or strings.Split.The manual loop to find
||can be simplified usingstrings.Index:♻️ Simplified delimiter parsing
- // Parse by splitting on "||" - firstDelim := -1 - for i := 0; i < len(payload)-1; i++ { - if payload[i] == '|' && payload[i+1] == '|' { - firstDelim = i - break - } - } - - if firstDelim == -1 { + // Parse by finding "||" delimiter + firstDelim := strings.Index(payload, "||") + if firstDelim == -1 { return "", 0, false, nil // Invalid format }Note: You would need to add
"strings"to imports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/token/token.go` around lines 86 - 97, Replace the manual loop that searches for the "||" delimiter on the payload with strings.Index(payload, "||") (or strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update the code paths that use firstDelim (the existing early-return when delimiter not found and subsequent substring logic) to use the Index/Split result, and add "strings" to the imports. This keeps the same behavior (returning "", 0, false, nil when not found) but removes the manual byte-by-byte scan and improves readability around the payload delimiter handling.docs/discovery-token-security.md (1)
13-15: Add language specifiers to fenced code blocks.Static analysis flagged these code blocks as missing language specifiers. Adding
textorplaintextfor the token format and ASCII diagram would satisfy the linter and improve rendering in some viewers.📝 Suggested fix
-``` +```text token = base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))And for the sequence diagram: ```diff -``` +```text ┌──────────────┐ ┌──────────┐ ┌──────────┐Also applies to: 52-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/discovery-token-security.md` around lines 13 - 15, Update the fenced code blocks that show the token format (the block containing "token = base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))") and the ASCII sequence diagram (the block starting with the box characters "┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or "plaintext" after the opening backticks; apply the same change to the other code fences noted in the comment range (the blocks around lines 52–77) so the linter accepts them and viewers render them as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 317-321: The manual command uses "openssl rand -base64 32" which
base64-encodes the 32 bytes and causes the stored signing-key to be ~44 bytes
after Kubernetes base64-ing it; instead generate 32 raw bytes, write them to a
temporary file, create the Kubernetes secret discovery-token-signing-secret
using --from-file=signing-key=<tempfile> so the exact 32-byte value is stored,
and then remove the temporary file; references: discovery-token-signing-secret,
signing-key, and avoid using openssl rand -base64 32.
In `@internal/controller/server_controller.go`:
- Around line 720-767: getOrCreateSigningSecret has two issues: concurrent
reconciles can race on creating the secret and invalid secrets are never
recovered. Modify getOrCreateSigningSecret to: when an existing secret is found
validate the "signing-key" and if invalid attempt to replace it (generate a new
key and Update/Patch the secret, handling Conflict by re-getting and
re-validating); when Create returns AlreadyExists after generating a new secret,
treat it as a benign race by fetching the newly-created secret and
validating/returning its signing-key instead of failing; in all fetch/update
paths handle transient IsNotFound/Conflict/AlreadyExists errors by re-fetching
and retrying validation once before returning an error. Reference:
getOrCreateSigningSecret, secret.Data["signing-key"], r.Create, r.Get, and use
metaltoken.GenerateSigningSecret for regeneration.
In `@internal/probe/probe.go`:
- Around line 214-219: The auth-error branch currently returns (false, nil)
which signals "not done, no error" so ExponentialBackoffWithContext keeps
retrying; change it to return a non-nil error to stop retries. In the block that
logs the auth failure (the a.log.Error call that checks resp.StatusCode ==
http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden), return a
terminal error (e.g., errors.New or fmt.Errorf with the status code and context)
instead of nil so ExponentialBackoffWithContext will cease retries.
In `@internal/registry/server.go`:
- Around line 89-96: The current validateDiscoveryToken function blindly allows
requests when s.k8sClient == nil which is risky; change the logic to require an
explicit test-mode flag instead of inferring from s.k8sClient by adding a Server
field like testMode (bool) and only bypass validation when testMode is true, and
also add a startup-time log warning if testMode is enabled; update
validateDiscoveryToken to check s.testMode (and still log that validation is
skipped) and ensure any creation/initialization paths set testMode explicitly
rather than relying on a nil k8sClient.
- Around line 104-133: validateDiscoveryToken performs lazy loading of
s.signingSecret without synchronization, causing a data race; add a synchronized
one-time loader and protect writes: introduce fields on Server (e.g.,
signingSecretOnce sync.Once and signingSecretMu sync.Mutex) and implement a
helper method loadSigningSecret() that uses signingSecretOnce.Do to fetch the
Secret via s.k8sClient, validate secret.Data["signing-key"], then assign
s.signingSecret while holding signingSecretMu; call loadSigningSecret() from
validateDiscoveryToken and only read s.signingSecret under the mutex or after
the Once to ensure safe concurrent access.
In `@internal/token/token.go`:
- Around line 40-42: Validate the systemUUID before constructing the payload to
prevent delimiter injection: in the function that builds the payload (where
systemUUID, timestamp, and payload are used) add an RFC‑4122 UUID regex check
against systemUUID and return an error if it doesn't match, so payload :=
fmt.Sprintf("%s||%d", systemUUID, timestamp) only runs for a valid UUID;
alternatively you may replace the delimiter approach with length‑prefixing or a
null byte separator and update the corresponding parsing logic (the token
parsing code referenced at lines 86-99) to use the new format.
---
Nitpick comments:
In `@cmd/main.go`:
- Around line 627-633: The secret name "discovery-token-signing-secret" is
duplicated; extract it to a shared constant and use that constant wherever the
literal is used. Define a package-level constant (e.g.,
DiscoveryTokenSigningSecret) in a common package/file imported by both the
registry setup and the controller, then replace the literal in the
registry.NewServer call and the usage in server_controller.go (and any other
occurrences) with that constant to ensure a single source of truth.
In `@docs/discovery-token-security.md`:
- Around line 13-15: Update the fenced code blocks that show the token format
(the block containing "token =
base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))")
and the ASCII sequence diagram (the block starting with the box characters
"┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or
"plaintext" after the opening backticks; apply the same change to the other code
fences noted in the comment range (the blocks around lines 52–77) so the linter
accepts them and viewers render them as plain text.
In `@internal/controller/server_controller_test.go`:
- Around line 437-446: The current brittle manual parsing of contents (variables
tokenStart, tokenValueStart, tokenEnd, actualToken) using strings.Index and
IndexAny should be replaced with a more robust extraction: use a regex to locate
--discovery-token=<value> (or parse the YAML/ignition to get the systemd unit
content first) and capture the token value, or parse the YAML structure to
retrieve the unit and then extract the flag; update the test to use that
regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic
so variations in quoting/whitespace won’t break the extraction.
- Around line 271-289: The test duplicates code to fetch the
"discovery-token-signing-secret" and generate a signed discovery token; extract
that logic into a helper function (e.g., getSignedDiscoveryToken(ctx
context.Context, k8sClient client.Client, ns string, systemUUID string) string)
placed near the top of the test file or a test utilities file, have it perform
the Eventually k8sClient.Get into a v1.Secret, assert signingKey length and
GenerateSignedDiscoveryToken (metaltoken.GenerateSignedDiscoveryToken), return
the token, and then replace the inline blocks in the tests (the creation of
signingSecret, signingKey check, metaltoken.GenerateSignedDiscoveryToken call,
and probe.NewAgent usage sites) with calls to getSignedDiscoveryToken to reduce
duplication and keep assertions intact.
In `@internal/token/token_test.go`:
- Around line 163-173: The test description is misleading about timing; update
the It description under Context("Token Security") to reflect that it verifies
signature correctness rather than constant-time behavior: change the It("should
use constant-time comparison (HMAC)") text to something like It("verifies signed
discovery token using HMAC signature") and/or update the comment above the test;
keep the test body using GenerateSignedDiscoveryToken and
VerifySignedDiscoveryToken unchanged so it still asserts successful generation
and valid verification.
- Around line 67-76: Test relies on wall-clock progress and can flake; update
the code or test to control time deterministically: either add a time provider
to GenerateSignedDiscoveryToken (e.g., inject a nowFunc or accept a timestamp
parameter) and in the test call GenerateSignedDiscoveryToken with two different
timestamps to assert tokens differ, or change the test to assert tokens may be
equal within the same second and avoid Eventually. Locate
GenerateSignedDiscoveryToken and the token_test.go case and implement the
time-injection approach (introduce nowFunc or timestamp arg) or alter the test
to compare tokens produced with explicit different times instead of using
Eventually.
In `@internal/token/token.go`:
- Around line 86-97: Replace the manual loop that searches for the "||"
delimiter on the payload with strings.Index(payload, "||") (or
strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update the
code paths that use firstDelim (the existing early-return when delimiter not
found and subsequent substring logic) to use the Index/Split result, and add
"strings" to the imports. This keeps the same behavior (returning "", 0, false,
nil when not found) but removes the manual byte-by-byte scan and improves
readability around the payload delimiter handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6424543f-9838-4804-8743-f03cb9c7032c
📒 Files selected for processing (13)
cmd/main.gocmd/metalprobe/main.godocs/discovery-token-security.mdinternal/api/registry/registry.gointernal/controller/server_controller.gointernal/controller/server_controller_test.gointernal/controller/suite_test.gointernal/probe/probe.gointernal/probe/probe_suite_test.gointernal/registry/registry_suite_test.gointernal/registry/server.gointernal/token/token.gointernal/token/token_test.go
| func (s *Server) validateDiscoveryToken(receivedToken string) (string, bool) { | ||
| // Skip validation in unit test mode (no k8s client) | ||
| if s.k8sClient == nil { | ||
| s.log.V(1).Info("Skipping token validation (no K8s client - unit test mode)") | ||
| // In test mode, extract systemUUID from token if possible, otherwise return empty | ||
| // For now, just allow all requests in test mode | ||
| return "", true | ||
| } |
There was a problem hiding this comment.
Bypassing authentication in test mode is risky.
When k8sClient == nil, all token validation is skipped and requests are allowed. While this is convenient for unit tests, it creates a risk if someone accidentally deploys with a nil client or misconfigures the server.
Consider logging a warning at startup if running in test mode, or requiring an explicit flag to enable test mode rather than inferring it from client presence.
📝 Suggested improvement
func NewServer(logger logr.Logger, addr string, k8sClient client.Client, signingSecretName, signingSecretNamespace string) *Server {
+ if k8sClient == nil {
+ logger.Info("WARNING: Running without K8s client - token validation DISABLED (test mode only)")
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/registry/server.go` around lines 89 - 96, The current
validateDiscoveryToken function blindly allows requests when s.k8sClient == nil
which is risky; change the logic to require an explicit test-mode flag instead
of inferring from s.k8sClient by adding a Server field like testMode (bool) and
only bypass validation when testMode is true, and also add a startup-time log
warning if testMode is enabled; update validateDiscoveryToken to check
s.testMode (and still log that validation is skipped) and ensure any
creation/initialization paths set testMode explicitly rather than relying on a
nil k8sClient.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
docs/discovery-token-security.md (1)
339-342:⚠️ Potential issue | 🟠 MajorManual secret creation command still produces incorrect key length.
As previously flagged,
openssl rand -base64 32generates 32 random bytes and base64-encodes them to ~44 characters. When stored via--from-literal, kubectl base64-encodes this again, so the decodedsigning-keywill be 44 bytes, not the required 32 bytes.🔧 Corrected command
-# Or manually create the secret -kubectl create secret generic discovery-token-signing-secret \ - --from-literal=signing-key=$(openssl rand -base64 32) \ - -n metal-operator-system +# Or manually create the secret (generate 32 raw bytes) +kubectl create secret generic discovery-token-signing-secret \ + --from-file=signing-key=<(openssl rand 32) \ + -n metal-operator-system🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/discovery-token-security.md` around lines 339 - 342, The current command uses openssl rand -base64 32 with --from-literal which yields a 44-byte base64 string that kubectl base64-encodes again, producing an incorrect 44-byte decoded signing-key; instead generate raw 32 bytes to a file and create the secret from that file so Kubernetes stores exactly 32 bytes: run openssl rand 32 > signing-key.bin and then kubectl create secret generic discovery-token-signing-secret --from-file=signing-key=signing-key.bin -n metal-operator-system (then securely remove signing-key.bin); update the documented command that references openssl rand -base64 32, --from-literal, and signing-key accordingly.
🧹 Nitpick comments (2)
internal/token/token_test.go (1)
69-78: Consider usingtime.Sleepor handling the error in Eventually.The test swallows the error from
GenerateSignedDiscoveryTokenon line 75. If token generation fails, the test would hang until timeout rather than fail fast with a clear error.♻️ Proposed fix to handle error explicitly
It("should generate different tokens at different times", func() { token1, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid") Expect(err).NotTo(HaveOccurred()) - // Small delay to ensure different timestamp - Eventually(func() string { - token2, _ := GenerateSignedDiscoveryToken(signingSecret, "test-uuid") - return token2 - }).ShouldNot(Equal(token1)) + // Wait to ensure different timestamp (JWT uses second precision) + time.Sleep(1100 * time.Millisecond) + + token2, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid") + Expect(err).NotTo(HaveOccurred()) + Expect(token2).NotTo(Equal(token1)) })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/token/token_test.go` around lines 69 - 78, The test swallows the error from GenerateSignedDiscoveryToken inside Eventually causing silent failures; update the "should generate different tokens at different times" test to handle errors explicitly — either insert a short time.Sleep between calls and call GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred()) before comparing token1 vs token2, or change the Eventually invocation to call a helper that returns (string, error) and assert on the error immediately; reference GenerateSignedDiscoveryToken, token1/token2 and signingSecret when making the change so the test fails fast on token generation errors instead of hanging.docs/discovery-token-security.md (1)
13-15: Add language specifiers to fenced code blocks for better rendering.The markdown linter flags code blocks without language specifiers. For non-code text blocks (token format examples and ASCII diagrams), use the
textspecifier.📐 Proposed fix
-``` +```text token = header.payload.signatureJWT Structure:
-
+text
..And for the ASCII diagram around line 68:
-``` +```text ┌──────────────┐ ┌──────────┐ ┌──────────┐ │ Controller │ │ Registry │ │ Server │ └──────┬───────┘ └────┬─────┘ └────┬─────┘Also applies to: 19-21, 68-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/discovery-token-security.md` around lines 13 - 15, The fenced code blocks in docs/discovery-token-security.md (examples like the line "token = header.payload.signature", the "JWT Structure" block showing "<base64-header>.<base64-payload>.<base64-signature>" and the ASCII diagram starting with "┌──────────────┐") lack language specifiers and fail the markdown linter; update each such fenced block (including the blocks around lines 19-21 and 68-94) to use the text language specifier (i.e., replace ``` with ```text for the token example, the JWT Structure example, and the ASCII diagram blocks) so they render correctly and satisfy the linter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 441-447: The "24-hour token lifetime (increased from 1 hour)"
statement under the "Not Needed (Now Implemented)" section is inconsistent with
the rest of the document and code examples that use a 1-hour expiry; update that
statement to reflect a 1-hour token lifetime (or alternatively update the
code/examples and the "Why 1 hour?" section if you intend to change to 24
hours). Specifically, edit the text under the "Not Needed (Now Implemented)"
heading to say "1-hour token lifetime" (and remove the "(increased from 1 hour)"
clause) and verify the JWT example and the "Why 1 hour?" section wording match
this value so the document is consistent.
- Around line 28-36: In the "Claims (Payload):" section remove the "-
**secret**: 32-byte (256-bit) shared secret..." bullet since the signing secret
is not a JWT claim; keep only "- **sub**", "- **iat**", and "- **exp**" as the
transmitted claims, and instead add a short note near the "Signature:
HMAC-SHA256(header + payload, secret)" line clarifying that the secret is used
only to compute/verify the HMAC and is not included in the token payload.
---
Duplicate comments:
In `@docs/discovery-token-security.md`:
- Around line 339-342: The current command uses openssl rand -base64 32 with
--from-literal which yields a 44-byte base64 string that kubectl base64-encodes
again, producing an incorrect 44-byte decoded signing-key; instead generate raw
32 bytes to a file and create the secret from that file so Kubernetes stores
exactly 32 bytes: run openssl rand 32 > signing-key.bin and then kubectl create
secret generic discovery-token-signing-secret
--from-file=signing-key=signing-key.bin -n metal-operator-system (then securely
remove signing-key.bin); update the documented command that references openssl
rand -base64 32, --from-literal, and signing-key accordingly.
---
Nitpick comments:
In `@docs/discovery-token-security.md`:
- Around line 13-15: The fenced code blocks in docs/discovery-token-security.md
(examples like the line "token = header.payload.signature", the "JWT Structure"
block showing "<base64-header>.<base64-payload>.<base64-signature>" and the
ASCII diagram starting with "┌──────────────┐") lack language specifiers and
fail the markdown linter; update each such fenced block (including the blocks
around lines 19-21 and 68-94) to use the text language specifier (i.e., replace
``` with ```text for the token example, the JWT Structure example, and the ASCII
diagram blocks) so they render correctly and satisfy the linter.
In `@internal/token/token_test.go`:
- Around line 69-78: The test swallows the error from
GenerateSignedDiscoveryToken inside Eventually causing silent failures; update
the "should generate different tokens at different times" test to handle errors
explicitly — either insert a short time.Sleep between calls and call
GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred()) before
comparing token1 vs token2, or change the Eventually invocation to call a helper
that returns (string, error) and assert on the error immediately; reference
GenerateSignedDiscoveryToken, token1/token2 and signingSecret when making the
change so the test fails fast on token generation errors instead of hanging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 34fd7cba-42bb-49b5-a86a-77dfe6d83e10
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (4)
docs/discovery-token-security.mdgo.modinternal/token/token.gointernal/token/token_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/token/token.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/registry/server.go (1)
310-319:⚠️ Potential issue | 🔴 CriticalNil pointer dereference when k8sClient is nil and payload.Booted is true.
In test mode (
k8sClient == nil), token validation passes at Lines 294-299. Ifpayload.Bootedis true, the code reaches Line 315 which callss.k8sClient.List(), causing a panic.🐛 Proposed fix to guard against nil client
s.log.Info("Received boot state for system", "SystemUUID", payload.SystemUUID, "BootState", payload.Booted) if !payload.Booted { w.WriteHeader(http.StatusOK) return } + // In test mode without k8s client, we cannot update ServerBootConfiguration + if s.k8sClient == nil { + s.log.Info("Running in test mode, skipping ServerBootConfiguration update") + w.WriteHeader(http.StatusOK) + return + } var servers metalv1alpha1.ServerList if err := s.k8sClient.List(r.Context(), &servers, client.MatchingFields{"spec.systemUUID": payload.SystemUUID}); err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/registry/server.go` around lines 310 - 319, The handler panics when payload.Booted is true because s.k8sClient can be nil in test mode; before calling s.k8sClient.List(...) check for s.k8sClient == nil and short-circuit (e.g., write a success response and return or return a controlled error) to avoid dereferencing nil. Update the code path that uses s.k8sClient.List, referencing s.k8sClient and the payload.Booted check, to handle the nil client case (log a clear message and respond without calling List).
♻️ Duplicate comments (1)
internal/controller/server_controller.go (1)
730-736:⚠️ Potential issue | 🟠 MajorInvalid secret still causes permanent failure without auto-recovery.
When the signing secret exists but has an invalid key (wrong length or missing), this path returns an error without attempting to regenerate. This will cause permanent reconciliation failures until manual intervention, while the
AlreadyExistspath (Lines 778-790) does attempt recovery.Consider applying the same regeneration logic here for consistency:
🐛 Proposed fix to add auto-recovery
if err == nil { // Secret exists, return the signing key if signingKey, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]; ok && len(signingKey) == 32 { return signingKey, nil } - // Secret exists but is invalid, regenerate - return nil, fmt.Errorf("existing signing secret is invalid") + // Secret exists but is invalid, update it with a valid key + ctrl.LoggerFrom(ctx).Info("Existing signing secret is invalid, regenerating", "name", secretName) + signingKey, err := metaltoken.GenerateSigningSecret() + if err != nil { + return nil, fmt.Errorf("failed to generate signing secret: %w", err) + } + secret.Data = map[string][]byte{metaltoken.DiscoveryTokenSigningSecretKey: signingKey} + if err := r.Update(ctx, secret); err != nil { + return nil, fmt.Errorf("failed to update signing secret: %w", err) + } + return signingKey, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller.go` around lines 730 - 736, The current branch inside the secret-fetch logic returns an error when secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing or not 32 bytes, causing permanent reconciliation failure; update the block in the function handling secret retrieval so that if signingKey is missing/invalid you treat it like a corrupted secret and trigger the same regeneration/recreate logic used in the AlreadyExists path (the code that handles creating a new signing secret), i.e., delete or overwrite the invalid secret and call the existing regeneration routine to create a valid 32-byte signing key rather than returning fmt.Errorf("existing signing secret is invalid").
🧹 Nitpick comments (1)
internal/registry/server.go (1)
144-151: Test mode bypass logs at V(1) which may not be visible.When running without a K8s client, authentication is completely bypassed. The warning is logged at verbosity level 1, which may not be visible in default logging configurations. Consider logging at Info level (V(0)) to ensure operators notice this security-sensitive configuration.
📝 Suggested improvement
// Skip validation in test mode (no k8s client) if s.k8sClient == nil { - s.log.V(1).Info("Running in TEST MODE without K8s client - skipping secret loading") + s.log.Info("WARNING: Running in TEST MODE without K8s client - token validation DISABLED") // In test mode, extract systemUUID from token if possible, otherwise return empty // For now, just allow all requests in test mode return "", true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/registry/server.go` around lines 144 - 151, The test-mode bypass in validateDiscoveryToken currently logs the security-sensitive message at verbosity V(1) which may be hidden; update the logging call on the s.k8sClient==nil branch (the s.log.V(1).Info(...) invocation inside validateDiscoveryToken) to use the default/info level (e.g. s.log.Info(...) or s.log.V(0).Info(...)) and make the message explicit (mention "TEST MODE - authentication bypassed, no K8s client") so operators will see it in normal log configurations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/probe/probe.go`:
- Around line 223-226: The log call is passing a nil `err` when the HTTP request
returned a non-success status; replace that by creating a real error describing
the response (e.g. using fmt.Errorf to include resp.StatusCode and optionally
the response body) and pass that error to `a.log.Error`, or alternatively call
`a.log.Error` without an error parameter and include the status and
`a.RegistryURL` in the message; target the block that checks `resp.StatusCode`
(the `if resp.StatusCode != http.StatusOK && resp.StatusCode !=
http.StatusCreated { ... }`), update the logging to use the newly constructed
error (or message) instead of the nil `err` so the log is informative and
non-nil.
---
Outside diff comments:
In `@internal/registry/server.go`:
- Around line 310-319: The handler panics when payload.Booted is true because
s.k8sClient can be nil in test mode; before calling s.k8sClient.List(...) check
for s.k8sClient == nil and short-circuit (e.g., write a success response and
return or return a controlled error) to avoid dereferencing nil. Update the code
path that uses s.k8sClient.List, referencing s.k8sClient and the payload.Booted
check, to handle the nil client case (log a clear message and respond without
calling List).
---
Duplicate comments:
In `@internal/controller/server_controller.go`:
- Around line 730-736: The current branch inside the secret-fetch logic returns
an error when secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing
or not 32 bytes, causing permanent reconciliation failure; update the block in
the function handling secret retrieval so that if signingKey is missing/invalid
you treat it like a corrupted secret and trigger the same regeneration/recreate
logic used in the AlreadyExists path (the code that handles creating a new
signing secret), i.e., delete or overwrite the invalid secret and call the
existing regeneration routine to create a valid 32-byte signing key rather than
returning fmt.Errorf("existing signing secret is invalid").
---
Nitpick comments:
In `@internal/registry/server.go`:
- Around line 144-151: The test-mode bypass in validateDiscoveryToken currently
logs the security-sensitive message at verbosity V(1) which may be hidden;
update the logging call on the s.k8sClient==nil branch (the s.log.V(1).Info(...)
invocation inside validateDiscoveryToken) to use the default/info level (e.g.
s.log.Info(...) or s.log.V(0).Info(...)) and make the message explicit (mention
"TEST MODE - authentication bypassed, no K8s client") so operators will see it
in normal log configurations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2c4ad7fa-b741-4419-82bd-b7a396b8689d
📒 Files selected for processing (8)
cmd/main.godocs/discovery-token-security.mdinternal/controller/server_controller.gointernal/controller/server_controller_test.gointernal/probe/probe.gointernal/registry/server.gointernal/token/constants.gointernal/token/token_test.go
✅ Files skipped from review due to trivial changes (2)
- docs/discovery-token-security.md
- internal/token/constants.go
Resolved merge conflicts between JWT authentication changes and registry client timeout refactoring: - Combined discoveryToken parameter with registryClientTimeout in probe.Agent - Updated NewAgent signature to include both authentication and timeout - Updated all test instantiations to include both parameters - Preserved JWT validation in bootstate endpoint - Maintained consistent error logging format Conflicts resolved in: - cmd/metalprobe/main.go - internal/probe/probe.go - internal/probe/probe_suite_test.go - internal/controller/server_controller_test.go
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
internal/controller/suite_test.go (1)
336-342: Use thetoken.DiscoveryTokenSigningSecretNameconstant instead of hardcoding the name.
cmd/main.goalready passesmetaltoken.DiscoveryTokenSigningSecretName(seeinternal/token/constants.go). Duplicating the literal string here creates a drift hazard if the constant is ever renamed. Importinternal/tokenin this test file and reference the constant for consistency.♻️ Proposed change
"github.com/ironcore-dev/metal-operator/internal/registry" + metaltoken "github.com/ironcore-dev/metal-operator/internal/token"registryServer := registry.NewServer( GinkgoLogr, ":30000", k8sManager.GetClient(), - "discovery-token-signing-secret", + metaltoken.DiscoveryTokenSigningSecretName, ns.Name, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/suite_test.go` around lines 336 - 342, Replace the hard-coded "discovery-token-signing-secret" literal passed to registry.NewServer in the suite_test setup with the shared constant by importing the internal/token package and using token.DiscoveryTokenSigningSecretName; specifically update the call to registry.NewServer(GinkgoLogr, ":30000", k8sManager.GetClient(), token.DiscoveryTokenSigningSecretName, ns.Name) and add the token import to the test file so the test references the canonical constant instead of duplicating the string.internal/registry/server.go (1)
44-73: Preload inNewServeris best-effort — consider logging at a higher level when it silently fails.If the Secret doesn't exist yet at registry startup (the common first-boot case),
erris logged atErrorbut the server still starts withsigningSecret == nil. Combined with thesync.Onceissue on the on-demand loader, the only signal of misconfiguration is a single startup log line that may be easy to miss. Once the on-demand loader is fixed to retry, this preload can arguably be simplified to a log-only probe (or removed entirely), since the lazy loader will handle it correctly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/registry/server.go` around lines 44 - 73, The preload of the discovery signing secret in NewServer currently logs the Get error at Error but leaves the server running with signingSecret nil and a silent startup state; update the preload to treat missing/invalid secret as a higher-visibility condition and not just a low-level error: in the NewServer preload block (the code that calls k8sClient.Get and inspects secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]) change the error/invalid-secret logging to a more prominent level (e.g., logger.Warn or logger.Info with clear context) and include a single clear message that token validation will fail until the secret is available; alternatively simplify/remove the eager load and keep only a prominent probe/log statement so the on-demand loader (the sync.Once lazy loader) is the definitive mechanism for retrieval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/server_controller_test.go`:
- Around line 442-459: The test reads the ignition secret and extracts the
discovery token before the Eventually assertion, allowing a reconciler run to
re-hash the password and make the MatchYAML comparison fail; move the secret
read, tokenRegex extraction and call to ignition.GenerateIgnitionDataFromFile
(which builds ignitionData using the extracted token) into the Eventually
closure so both actualToken and expected ignitionData are computed atomically
with the final secret state (references: DefaultIgnitionSecretKeyName,
tokenRegex, ignition.GenerateIgnitionDataFromFile, and the test's Eventually
block) — update the closure to re-extract
ignitionSecret.Data[DefaultIgnitionSecretKeyName], compute actualToken,
regenerate ignitionData, then perform the MatchYAML assertion.
In `@internal/registry/server.go`:
- Around line 89-139: The current use of sync.Once in loadSigningSecret along
with setting signingSecretErr causes permanent caching of failures so the server
never recovers; change the approach to remove/stop using signingSecretOnce and
instead use a mutex-guarded loader that only memoizes successful loads: let
getSigningSecret check the cached s.signingSecret under signingSecretMu (fast
read path) and if empty/invalid attempt to fetch from k8s (using
s.k8sClient.Get) while holding a write lock or using a separate load mutex, set
s.signingSecret only on success and clear/avoid setting signingSecretErr on
transient failures so subsequent calls retry; ensure compatibility with
NewServer which may pre-populate s.signingSecret by skipping reload when
s.signingSecret is already valid.
---
Nitpick comments:
In `@internal/controller/suite_test.go`:
- Around line 336-342: Replace the hard-coded "discovery-token-signing-secret"
literal passed to registry.NewServer in the suite_test setup with the shared
constant by importing the internal/token package and using
token.DiscoveryTokenSigningSecretName; specifically update the call to
registry.NewServer(GinkgoLogr, ":30000", k8sManager.GetClient(),
token.DiscoveryTokenSigningSecretName, ns.Name) and add the token import to the
test file so the test references the canonical constant instead of duplicating
the string.
In `@internal/registry/server.go`:
- Around line 44-73: The preload of the discovery signing secret in NewServer
currently logs the Get error at Error but leaves the server running with
signingSecret nil and a silent startup state; update the preload to treat
missing/invalid secret as a higher-visibility condition and not just a low-level
error: in the NewServer preload block (the code that calls k8sClient.Get and
inspects secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]) change the
error/invalid-secret logging to a more prominent level (e.g., logger.Warn or
logger.Info with clear context) and include a single clear message that token
validation will fail until the secret is available; alternatively
simplify/remove the eager load and keep only a prominent probe/log statement so
the on-demand loader (the sync.Once lazy loader) is the definitive mechanism for
retrieval.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f6d3222-c160-4f30-9451-68aa4354b25c
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (9)
cmd/main.gocmd/metalprobe/main.gogo.modinternal/controller/server_controller.gointernal/controller/server_controller_test.gointernal/controller/suite_test.gointernal/probe/probe.gointernal/probe/probe_suite_test.gointernal/registry/server.go
✅ Files skipped from review due to trivial changes (1)
- go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/probe/probe_suite_test.go
- cmd/metalprobe/main.go
- internal/probe/probe.go
This commit addresses 12 issues identified by CodeRabbitAI: **Critical/Major Fixes:** 1. Fix probe authentication retry logic - return terminal error on 401/403 instead of (false, nil) which caused infinite retries 2. Fix registry secret loading race condition - replace sync.Once with mutex-guarded loader that only memoizes success, allowing recovery from transient failures 3. Fix controller secret creation race - handle AlreadyExists errors and regenerate invalid secrets instead of permanent failure 4. Fix test race condition - move bcrypt hash and token extraction inside Eventually closure to prevent stale data mismatches **Improvements:** 5. Fix nil error logging in probe - change Error to Info for non-success HTTP status codes where err is nil 6. Add UUID validation - validate RFC 4122 format before token generation to prevent delimiter injection 7. Add test mode warning - log warning when registry starts without K8s client (token validation disabled) 8. Update test fixtures - use valid RFC 4122 UUIDs in token tests All tests passing. No breaking changes. Fixes issues identified in PR review: - Race conditions in concurrent reconciliation - Permanent failure caching with sync.Once - Auth retry logic bug - Test flakiness with bcrypt hashes
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/server_controller.go`:
- Around line 817-843: The fetched Secret in the AlreadyExists branch may have
secret.Data == nil causing a panic on assignment; before assigning into
secret.Data in the recovery path (after calling r.Get and after generating
signingKey via metaltoken.GenerateSigningSecret), ensure secret.Data is
initialized (e.g., set to make(map[string][]byte) when nil) and then assign
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey, then call
r.Update to persist the replacement key.
In `@internal/registry/server.go`:
- Around line 95-176: The getSigningSecret and validateDiscoveryToken functions
must accept a context so request cancellation and shutdown are respected: change
signature to getSigningSecret(ctx context.Context) and
validateDiscoveryToken(ctx context.Context, receivedToken string), remove
context.Background() and pass ctx into s.k8sClient.Get; update all call sites
(e.g., HTTP handlers that call validateDiscoveryToken) to pass r.Context(). Also
implement a single-retry secret refresh on signature/verification failures
inside validateDiscoveryToken: on VerifySignedDiscoveryToken returning an error
or invalid, clear the cached s.signingSecret under s.signingSecretMu lock, call
getSigningSecret(ctx) again to reload the secret, retry
VerifySignedDiscoveryToken once, and only then reject the request if still
failing; keep log calls as appropriate and preserve the 32-byte validation logic
in getSigningSecret.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d175cecf-f0cd-4901-b76f-412526c941ba
📒 Files selected for processing (6)
internal/controller/server_controller.gointernal/controller/server_controller_test.gointernal/probe/probe.gointernal/registry/server.gointernal/token/token.gointernal/token/token_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/probe/probe.go
- internal/controller/server_controller_test.go
- internal/token/token_test.go
| if err := r.Create(ctx, newSecret); err != nil { | ||
| if apierrors.IsAlreadyExists(err) { | ||
| // Another reconciler won the race - fetch the existing secret | ||
| ctrl.LoggerFrom(ctx).Info("Signing secret already exists, retrieving it") | ||
|
|
||
| secret := &v1.Secret{} | ||
| if err := r.Get(ctx, types.NamespacedName{ | ||
| Name: secretName, | ||
| Namespace: r.ManagerNamespace, | ||
| }, secret); err != nil { | ||
| return nil, fmt.Errorf("failed to retrieve existing signing secret: %w", err) | ||
| } | ||
|
|
||
| if signingKey, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]; ok && len(signingKey) == 32 { | ||
| return signingKey, nil | ||
| } | ||
|
|
||
| // Existing secret is invalid - update it | ||
| signingKey, err := metaltoken.GenerateSigningSecret() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to generate replacement signing secret: %w", err) | ||
| } | ||
|
|
||
| secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey | ||
| if err := r.Update(ctx, secret); err != nil { | ||
| return nil, fmt.Errorf("failed to update invalid signing secret: %w", err) | ||
| } |
There was a problem hiding this comment.
Initialize secret.Data before writing the replacement key.
In the AlreadyExists recovery path, the fetched Secret can exist with Data == nil if another actor created an empty Secret between the initial Get and Create. Line 840 would then panic with assignment to entry in nil map; initialize the map before assigning.
🐛 Proposed fix
// Existing secret is invalid - update it
signingKey, err := metaltoken.GenerateSigningSecret()
if err != nil {
return nil, fmt.Errorf("failed to generate replacement signing secret: %w", err)
}
+ if secret.Data == nil {
+ secret.Data = map[string][]byte{}
+ }
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey
if err := r.Update(ctx, secret); err != nil {
return nil, fmt.Errorf("failed to update invalid signing secret: %w", err)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/server_controller.go` around lines 817 - 843, The fetched
Secret in the AlreadyExists branch may have secret.Data == nil causing a panic
on assignment; before assigning into secret.Data in the recovery path (after
calling r.Get and after generating signingKey via
metaltoken.GenerateSigningSecret), ensure secret.Data is initialized (e.g., set
to make(map[string][]byte) when nil) and then assign
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] = signingKey, then call
r.Update to persist the replacement key.
| func (s *Server) getSigningSecret() ([]byte, error) { | ||
| // Fast path: check if we already have a valid cached secret (read lock) | ||
| s.signingSecretMu.RLock() | ||
| if len(s.signingSecret) == 32 { | ||
| secret := s.signingSecret | ||
| s.signingSecretMu.RUnlock() | ||
| return secret, nil | ||
| } | ||
| s.signingSecretMu.RUnlock() | ||
|
|
||
| // Slow path: need to load the secret (write lock) | ||
| if s.signingSecretName == "" || s.signingSecretNs == "" { | ||
| return nil, fmt.Errorf("signing secret name or namespace not configured") | ||
| } | ||
|
|
||
| // Attempt to load the secret from Kubernetes | ||
| ctx := context.Background() | ||
| secret := &corev1.Secret{} | ||
| err := s.k8sClient.Get(ctx, client.ObjectKey{ | ||
| Name: s.signingSecretName, | ||
| Namespace: s.signingSecretNs, | ||
| }, secret) | ||
|
|
||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to load signing secret: %w", err) | ||
| } | ||
|
|
||
| key, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] | ||
| if !ok || len(key) != 32 { | ||
| return nil, fmt.Errorf("signing secret invalid or missing") | ||
| } | ||
|
|
||
| // Cache the loaded secret (write lock) | ||
| s.signingSecretMu.Lock() | ||
| s.signingSecret = key | ||
| s.signingSecretMu.Unlock() | ||
|
|
||
| s.log.Info("Loaded discovery token signing secret", "name", s.signingSecretName) | ||
| return key, nil | ||
| } | ||
|
|
||
| // validateDiscoveryToken verifies a JWT-signed discovery token. | ||
| // Returns (systemUUID, valid) where systemUUID is extracted from the token. | ||
| // If k8sClient is nil (unit test mode), validation is skipped. | ||
| func (s *Server) validateDiscoveryToken(receivedToken string) (string, bool) { | ||
| // Skip validation in test mode (no k8s client) | ||
| if s.k8sClient == nil { | ||
| s.log.V(1).Info("Running in TEST MODE without K8s client - skipping secret loading") | ||
| // In test mode, extract systemUUID from token if possible, otherwise return empty | ||
| // For now, just allow all requests in test mode | ||
| return "", true | ||
| } | ||
|
|
||
| // Reject if token is missing | ||
| if receivedToken == "" { | ||
| s.log.Info("Rejected request with missing discovery token") | ||
| return "", false | ||
| } | ||
|
|
||
| // Get signing secret (thread-safe, loads on first call) | ||
| secret, err := s.getSigningSecret() | ||
| if err != nil { | ||
| s.log.Error(err, "Signing secret not available") | ||
| return "", false | ||
| } | ||
|
|
||
| // Verify the signed token | ||
| systemUUID, timestamp, valid, err := metaltoken.VerifySignedDiscoveryToken(secret, receivedToken) | ||
| if err != nil { | ||
| s.log.Error(err, "Error verifying discovery token") | ||
| return "", false | ||
| } | ||
|
|
||
| if !valid { | ||
| s.log.Info("Rejected request with invalid discovery token", "tokenLength", len(receivedToken)) | ||
| return "", false | ||
| } | ||
|
|
||
| // Token is valid | ||
| s.log.V(1).Info("Validated discovery token", "systemUUID", systemUUID, "timestamp", timestamp) | ||
| return systemUUID, true | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify registry token validation no longer uses context.Background in the request path.
rg -n -C3 'context\.Background\(\)|validateDiscoveryToken\(|getSigningSecret\(' --type=go internal/registry/server.goRepository: ironcore-dev/metal-operator
Length of output: 2438
Pass request context through secret loading and add secret refresh on validation failure.
The getSigningSecret() method uses context.Background() when called from HTTP handlers (lines 155, 201, 289), preventing request cancellation and respecting server shutdown. Additionally, once a valid 32-byte secret is cached, it is never reloaded; if the controller repairs or recreates the signing Secret, validation fails for all new tokens until the registry restarts.
Required changes:
- Add
ctx context.Contextparameter togetSigningSecret(ctx)and remove thecontext.Background()call at line 111 - Add
ctx context.Contextparameter tovalidateDiscoveryToken(ctx, receivedToken) - Update handler call sites (lines 201, 289) to pass
r.Context()tovalidateDiscoveryToken() - When token validation fails due to signature errors, consider reloading the Secret from Kubernetes before rejecting the request, or implement a TTL/watch mechanism to refresh on Secret changes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/registry/server.go` around lines 95 - 176, The getSigningSecret and
validateDiscoveryToken functions must accept a context so request cancellation
and shutdown are respected: change signature to getSigningSecret(ctx
context.Context) and validateDiscoveryToken(ctx context.Context, receivedToken
string), remove context.Background() and pass ctx into s.k8sClient.Get; update
all call sites (e.g., HTTP handlers that call validateDiscoveryToken) to pass
r.Context(). Also implement a single-retry secret refresh on
signature/verification failures inside validateDiscoveryToken: on
VerifySignedDiscoveryToken returning an error or invalid, clear the cached
s.signingSecret under s.signingSecretMu lock, call getSigningSecret(ctx) again
to reload the secret, retry VerifySignedDiscoveryToken once, and only then
reject the request if still failing; keep log calls as appropriate and preserve
the 32-byte validation logic in getSigningSecret.
Proposed Changes
This PR implements JWT-signed discovery tokens to secure the discovery boot data flow between metalprobe agents and the registry server.
Key Changes
Security Implementation:
/registerand/bootstateendpointsCode Changes:
internal/token/package for JWT token generation and verificationDocumentation:
docs/discovery-token-security.md- Complete operational guide covering:Dependencies:
github.com/golang-jwt/jwt/v5- Industry-standard JWT libraryReview Feedback Addressed
All CodeRabbitAI feedback has been addressed in commit 986d7cc:
Testing
Verification
The implementation provides:
Fixes #749
Summary by CodeRabbit
New Features
Documentation