Add ServerNetworkConfig CRD to gate server availability on network check#777
Add ServerNetworkConfig CRD to gate server availability on network check#777
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
3e2f81c to
1cb3c91
Compare
1cb3c91 to
c49c5ba
Compare
c49c5ba to
394eb5c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
internal/controller/server_controller.go (2)
424-431:⚠️ Potential issue | 🟠 MajorKeep the discovery snapshot until the network gate passes.
The registry entry is deleted before
runNetworkCheck; if the gate fails, the server remains inDiscoverybut later SNC corrections may not have discovery data to re-check unless the agent re-posts.💡 Suggested flow adjustment
- if err := r.invalidateRegistryEntryForServer(ctx, server); err != nil { - return false, fmt.Errorf("failed to invalidate registry entry for server: %w", err) - } - log.V(1).Info("Removed Server from Registry") - if done, err := r.runNetworkCheck(ctx, server); err != nil || !done { return false, err } + if err := r.invalidateRegistryEntryForServer(ctx, server); err != nil { + return false, fmt.Errorf("failed to invalidate registry entry for server: %w", err) + } + log.V(1).Info("Removed Server from Registry") + log.V(1).Info("Setting Server state to available")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller.go` around lines 424 - 431, The code deletes the registry discovery snapshot before performing the network gate check, which can leave the server in Discovery without data if the gate fails; change the flow in the surrounding function so that runNetworkCheck(ctx, server) is called first and only when it returns done==true (and no error) call invalidateRegistryEntryForServer(ctx, server) and log removal; keep runNetworkCheck and invalidateRegistryEntryForServer calls and their error handling but swap their order and ensure errors from runNetworkCheck are returned without deleting the registry entry.
1099-1110:⚠️ Potential issue | 🟠 MajorResolve
ServerNetworkConfigby the deterministic object name.Both paths list all SNCs and select the first matching
spec.serverRef.name, whileensureServerNetworkConfigcreates/gets the canonical object namedserver.Name. A second SNC referencing the same server can make gate evaluation order-dependent.♻️ Suggested helper-based fix
+func (r *ServerReconciler) getServerNetworkConfig(ctx context.Context, server *metalv1alpha1.Server) (*metalv1alpha1.ServerNetworkConfig, error) { + networkConfig := &metalv1alpha1.ServerNetworkConfig{} + if err := r.Get(ctx, client.ObjectKey{Name: server.Name}, networkConfig); err != nil { + return nil, err + } + return networkConfig, nil +} + func (r *ServerReconciler) runNetworkCheck(ctx context.Context, server *metalv1alpha1.Server) (bool, error) { log := ctrl.LoggerFrom(ctx) - networkConfigList := &metalv1alpha1.ServerNetworkConfigList{} - if err := r.List(ctx, networkConfigList); err != nil { - return false, fmt.Errorf("failed to list ServerNetworkConfigs: %w", err) - } - - var networkConfig *metalv1alpha1.ServerNetworkConfig - for i := range networkConfigList.Items { - if networkConfigList.Items[i].Spec.ServerRef.Name == server.Name { - networkConfig = &networkConfigList.Items[i] - break - } - } - - if networkConfig == nil { + networkConfig, err := r.getServerNetworkConfig(ctx, server) + if apierrors.IsNotFound(err) { return false, fmt.Errorf("ServerNetworkConfig not found for server %s — should have been created at Discovery start", server.Name) } + if err != nil { + return false, fmt.Errorf("failed to get ServerNetworkConfig: %w", err) + }Apply the same
getServerNetworkConfighelper inpopulateCurrentStatus.Also applies to: 1279-1293
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller.go` around lines 1099 - 1110, populateCurrentStatus currently lists ServerNetworkConfigList and picks the first item whose Spec.ServerRef.Name matches the server, which can pick the wrong object; replace that logic to call the existing getServerNetworkConfig helper used by ensureServerNetworkConfig to deterministically fetch the ServerNetworkConfig by the canonical name (server.Name) and return/handle not-found appropriately; apply the same replacement for the duplicate logic at the other occurrence (around lines 1279-1293) so both populateCurrentStatus and the second block consistently use getServerNetworkConfig instead of scanning ServerNetworkConfigList.internal/controller/server_controller_test.go (1)
1397-1407:⚠️ Potential issue | 🟡 MinorLet the SNC watch drive the recovery path.
After correcting
spec.interfaces, this test should wait for theServerNetworkConfigwatch to enqueue theServer; re-posting discovery data can hide a broken watch or stale-snapshot regression.🧪 Suggested test adjustment
By("Correcting the SNC spec to match real data") Eventually(func(g Gomega) { g.Expect(k8sClient.Get(ctx, client.ObjectKey{Name: server.Name}, snc)).To(Succeed()) snc.Spec.Interfaces = []metalv1alpha1.SpecNetworkInterface{ {Name: "eth0", MACAddress: testMAC1, Expected: &metalv1alpha1.ExpectedNetworkDetails{Switch: testSwitch1, Port: testPort1}}, } g.Expect(k8sClient.Update(ctx, snc)).To(Succeed()) }).Should(Succeed()) - By("Re-posting discovery data to trigger another reconcile") - registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1) + By("Waiting for the ServerNetworkConfig watch to trigger another reconcile")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/server_controller_test.go` around lines 1397 - 1407, The test currently forces another reconcile by calling registerFakeDiscovery after updating the ServerNetworkConfig (snc); instead, remove that call and wait for the SNC watch to drive reconciliation by polling the Server resource until the controller-side change is observed. Concretely: after k8sClient.Update(ctx, snc) (the SNC update in the Eventually block), replace registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1) with an Eventually that fetches the Server (client.ObjectKey{Name: server.Name}) via k8sClient.Get and asserts the controller-applied change (e.g., updated Status fields, populated network interface info, or changed ResourceVersion/annotation the controller sets) so the test relies on the SNC watch triggering reconcile rather than reposting discovery.api/v1alpha1/servernetworkconfig_types.go (1)
14-16:⚠️ Potential issue | 🟠 MajorMake
serverRef.namenon-empty at the API boundary.
LocalObjectReferencemakesserverRefpresent but does not by itself require a non-emptyname;serverRef: {}is then ignored by the watch mapper and cannot be matched by the gate.🛡️ Suggested API shape
import ( - v1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" ) +// ServerNetworkConfigServerReference references the Server this config applies to. +type ServerNetworkConfigServerReference struct { + // Name is the name of the referenced Server. + // +required + // +kubebuilder:validation:MinLength=1 + Name string `json:"name"` +} + // ServerNetworkConfigSpec defines the expected network topology for a server, // written by an external source such as argora/NetBox. type ServerNetworkConfigSpec struct { // ServerRef references the Server this config applies to. // +required - ServerRef v1.LocalObjectReference `json:"serverRef"` + ServerRef ServerNetworkConfigServerReference `json:"serverRef"`Run this read-only check after regenerating manifests;
spec.serverRef.nameshould include a non-empty validation such asminLength: 1:#!/bin/bash set -euo pipefail crd="$(fd 'metal\.ironcore\.dev_servernetworkconfigs\.yaml$' . | head -n1)" test -n "$crd" sed -n '/serverRef:/,/interfaces:/p' "$crd"Based on learnings, "When a Kubernetes controller (reconciler) errors out and does not proceed if a CRD spec field is nil/empty, that field should not be treated as optional in the API schema."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/servernetworkconfig_types.go` around lines 14 - 16, The ServerRef field currently uses v1.LocalObjectReference which allows an empty name; replace it with a small local struct (e.g., type ServerRef struct { Name string `json:"name"` // +kubebuilder:validation:MinLength=1 } ) or an inline anonymous struct for the ServerRef field so that spec.serverRef.name has a kubebuilder JSONSchema validation of minLength 1; update the ServerRef field declaration (currently ServerRef v1.LocalObjectReference) to use that new struct type/name, regenerate CRDs/manifests, and verify spec.serverRef.name appears with minLength: 1 in the generated CRD.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/servernetworkconfig_types.go`:
- Around line 33-35: The schema currently allows spec.interfaces entries with
Name/MAC but Expected==nil which runNetworkCheck skips; update
servernetworkconfig_types.go to prevent this by adding a validation tag to the
interface type so Expected is required for populated interfaces (e.g. add a
kubebuilder validation tag / remove +optional on Expected in the Interface
struct), and also make runNetworkCheck treat Expected == nil as a mismatch (do
not skip) by changing the check in runNetworkCheck to record/return a mismatch
error when Expected is nil for any non-empty interface entry; reference the
Expected field and the runNetworkCheck function when making these edits.
In `@internal/controller/server_controller.go`:
- Around line 105-106: The RBAC markers for ServerNetworkConfig are missing the
create verb so in-cluster reconciliation fails when ensureServerNetworkConfig
calls r.Create; update the kubebuilder RBAC marker that references the
servernetworkconfigs resource to include create (e.g. change the verbs list from
get;list;watch;update;patch to get;list;watch;create;update;patch) so the
controller can successfully call r.Create in ensureServerNetworkConfig and
handle status/finalizer operations as needed.
---
Duplicate comments:
In `@api/v1alpha1/servernetworkconfig_types.go`:
- Around line 14-16: The ServerRef field currently uses v1.LocalObjectReference
which allows an empty name; replace it with a small local struct (e.g., type
ServerRef struct { Name string `json:"name"` //
+kubebuilder:validation:MinLength=1 } ) or an inline anonymous struct for the
ServerRef field so that spec.serverRef.name has a kubebuilder JSONSchema
validation of minLength 1; update the ServerRef field declaration (currently
ServerRef v1.LocalObjectReference) to use that new struct type/name, regenerate
CRDs/manifests, and verify spec.serverRef.name appears with minLength: 1 in the
generated CRD.
In `@internal/controller/server_controller_test.go`:
- Around line 1397-1407: The test currently forces another reconcile by calling
registerFakeDiscovery after updating the ServerNetworkConfig (snc); instead,
remove that call and wait for the SNC watch to drive reconciliation by polling
the Server resource until the controller-side change is observed. Concretely:
after k8sClient.Update(ctx, snc) (the SNC update in the Eventually block),
replace registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1) with an
Eventually that fetches the Server (client.ObjectKey{Name: server.Name}) via
k8sClient.Get and asserts the controller-applied change (e.g., updated Status
fields, populated network interface info, or changed ResourceVersion/annotation
the controller sets) so the test relies on the SNC watch triggering reconcile
rather than reposting discovery.
In `@internal/controller/server_controller.go`:
- Around line 424-431: The code deletes the registry discovery snapshot before
performing the network gate check, which can leave the server in Discovery
without data if the gate fails; change the flow in the surrounding function so
that runNetworkCheck(ctx, server) is called first and only when it returns
done==true (and no error) call invalidateRegistryEntryForServer(ctx, server) and
log removal; keep runNetworkCheck and invalidateRegistryEntryForServer calls and
their error handling but swap their order and ensure errors from runNetworkCheck
are returned without deleting the registry entry.
- Around line 1099-1110: populateCurrentStatus currently lists
ServerNetworkConfigList and picks the first item whose Spec.ServerRef.Name
matches the server, which can pick the wrong object; replace that logic to call
the existing getServerNetworkConfig helper used by ensureServerNetworkConfig to
deterministically fetch the ServerNetworkConfig by the canonical name
(server.Name) and return/handle not-found appropriately; apply the same
replacement for the duplicate logic at the other occurrence (around lines
1279-1293) so both populateCurrentStatus and the second block consistently use
getServerNetworkConfig instead of scanning ServerNetworkConfigList.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 30515c59-5110-48b5-be0f-63b8645565c5
⛔ Files ignored due to path filters (3)
dist/chart/templates/crd/metal.ironcore.dev_servernetworkconfigs.yamlis excluded by!**/dist/**dist/chart/templates/crd/metal.ironcore.dev_servers.yamlis excluded by!**/dist/**dist/chart/templates/rbac/role.yamlis excluded by!**/dist/**
📒 Files selected for processing (9)
api/v1alpha1/server_types.goapi/v1alpha1/servernetworkconfig_types.goapi/v1alpha1/zz_generated.deepcopy.goconfig/crd/bases/metal.ironcore.dev_servernetworkconfigs.yamlconfig/crd/bases/metal.ironcore.dev_servers.yamlconfig/rbac/role.yamldocs/api-reference/api.mdinternal/controller/server_controller.gointernal/controller/server_controller_test.go
💤 Files with no reviewable changes (1)
- config/crd/bases/metal.ironcore.dev_servers.yaml
✅ Files skipped from review due to trivial changes (2)
- config/crd/bases/metal.ironcore.dev_servernetworkconfigs.yaml
- config/rbac/role.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/api-reference/api.md
- SNC is cluster-scoped, auto-created by metal-operator at Discovery start - spec.interfaces: expected topology (argora writes) - status.interfaces: all interfaces with full detail (metal-operator writes) - Server.Status.NetworkInterfaces: active (up) interfaces only, name/MAC/IPs/carrierStatus - NetworkCheck condition on Server.Status.Conditions (NoConstraint/Pending/Passed/Failed) - Gate is opt-in: empty spec.interfaces → NoConstraint, server proceeds - runNetworkCheck reads LLDP from snc.Status.Interfaces - 3 controller tests covering all condition states
394eb5c to
6e40406
Compare
- Make Expected field required (non-pointer value) in SpecNetworkInterface - Add create verb to servernetworkconfigs RBAC marker
6e40406 to
af4bd07
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
api/v1alpha1/server_types.go (1)
27-30: Constant placement: consider a dedicated condition-types block.
ServerConditionTypeNetworkCheckis added to the sameconst (...)block that mixesPower-typed values and topology annotation keys. It works, but grouping server condition type constants in their own block (or nearServerStatus.Conditions) would improve discoverability as more condition types are added. Purely a readability nit — no functional issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/server_types.go` around lines 27 - 30, Move the ServerConditionTypeNetworkCheck constant out of the mixed const block and place it in its own "Server condition types" const block located near ServerStatus.Conditions (or wherever ServerStatus is defined) so condition-type constants are grouped together for discoverability; update any references to ServerConditionTypeNetworkCheck if you change its grouping but do not rename the constant itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@api/v1alpha1/server_types.go`:
- Around line 27-30: Move the ServerConditionTypeNetworkCheck constant out of
the mixed const block and place it in its own "Server condition types" const
block located near ServerStatus.Conditions (or wherever ServerStatus is defined)
so condition-type constants are grouped together for discoverability; update any
references to ServerConditionTypeNetworkCheck if you change its grouping but do
not rename the constant itself.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c111f6cc-0fa1-4ba5-921a-dc6f98ea4eae
⛔ Files ignored due to path filters (3)
dist/chart/templates/crd/metal.ironcore.dev_servernetworkconfigs.yamlis excluded by!**/dist/**dist/chart/templates/crd/metal.ironcore.dev_servers.yamlis excluded by!**/dist/**dist/chart/templates/rbac/role.yamlis excluded by!**/dist/**
📒 Files selected for processing (9)
api/v1alpha1/server_types.goapi/v1alpha1/servernetworkconfig_types.goapi/v1alpha1/zz_generated.deepcopy.goconfig/crd/bases/metal.ironcore.dev_servernetworkconfigs.yamlconfig/crd/bases/metal.ironcore.dev_servers.yamlconfig/rbac/role.yamldocs/api-reference/api.mdinternal/controller/server_controller.gointernal/controller/server_controller_test.go
💤 Files with no reviewable changes (1)
- config/crd/bases/metal.ironcore.dev_servers.yaml
✅ Files skipped from review due to trivial changes (2)
- internal/controller/server_controller_test.go
- api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (3)
- config/rbac/role.yaml
- api/v1alpha1/servernetworkconfig_types.go
- internal/controller/server_controller.go
Closes #819
Summary
Adds
ServerNetworkConfigCRD (short name:snc, cluster-scoped) — auto-created by metal-operator at the start of Discovery for every serverspec.interfaces: expected topology per MAC — name, macAddress, expected.{switch, port} (argora writes)status.interfaces: discovered state after each probe run — all interfaces with IPs, carrierStatus, speed, LLDP neighborsServer.Status.NetworkInterfaces: simplified to active (up) interfaces only, name/MAC/IPs/carrierStatusNetwork check result written as a
NetworkCheckcondition onServer.Status.Conditions:spec.interfacescondition.reasonNoConstraintPendingPassedFailedGate is opt-in: argora enables it by writing
spec.interfaces; empty spec means no constraintTests
spec.interfaces→ server reaches Available,condition.reason: NoConstraintcondition.reason: Passedcondition.reason: Failed; correctingspec.interfaces→ server reaches Available