Skip to content

Add ServerNetworkConfig CRD to gate server availability on network check#777

Open
xkonni wants to merge 2 commits intomainfrom
feat/networkCheck
Open

Add ServerNetworkConfig CRD to gate server availability on network check#777
xkonni wants to merge 2 commits intomainfrom
feat/networkCheck

Conversation

@xkonni
Copy link
Copy Markdown
Contributor

@xkonni xkonni commented Apr 2, 2026

Closes #819

Summary

  • Adds ServerNetworkConfig CRD (short name: snc, cluster-scoped) — auto-created by metal-operator at the start of Discovery for every server

  • spec.interfaces: expected topology per MAC — name, macAddress, expected.{switch, port} (argora writes)

  • status.interfaces: discovered state after each probe run — all interfaces with IPs, carrierStatus, speed, LLDP neighbors

  • Server.Status.NetworkInterfaces: simplified to active (up) interfaces only, name/MAC/IPs/carrierStatus

  • Network check result written as a NetworkCheck condition on Server.Status.Conditions:

    spec.interfaces gate condition.reason server proceeds?
    empty off NoConstraint Yes
    populated, discovery not yet run on, waiting Pending No
    populated, all matched on Passed Yes
    populated, mismatch found on Failed No
  • Gate is opt-in: argora enables it by writing spec.interfaces; empty spec means no constraint

Tests

  • SNC with empty spec.interfaces → server reaches Available, condition.reason: NoConstraint
  • SNC with matching interfaces → server reaches Available, condition.reason: Passed
  • SNC with mismatched interfaces → server stays in Discovery, condition.reason: Failed; correcting spec.interfaces → server reaches Available

@xkonni xkonni requested a review from a team as a code owner April 2, 2026 11:48
@github-actions github-actions Bot added size/XL api-change documentation Improvements or additions to documentation labels Apr 2, 2026
@coderabbitai

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@xkonni xkonni force-pushed the feat/networkCheck branch from 1cb3c91 to c49c5ba Compare April 21, 2026 15:15
@github-actions github-actions Bot added size/XXL and removed size/XL labels Apr 21, 2026
coderabbitai[bot]

This comment was marked as outdated.

@xkonni xkonni force-pushed the feat/networkCheck branch from c49c5ba to 394eb5c Compare April 22, 2026 06:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
internal/controller/server_controller.go (2)

424-431: ⚠️ Potential issue | 🟠 Major

Keep the discovery snapshot until the network gate passes.

The registry entry is deleted before runNetworkCheck; if the gate fails, the server remains in Discovery but later SNC corrections may not have discovery data to re-check unless the agent re-posts.

💡 Suggested flow adjustment
-	if err := r.invalidateRegistryEntryForServer(ctx, server); err != nil {
-		return false, fmt.Errorf("failed to invalidate registry entry for server: %w", err)
-	}
-	log.V(1).Info("Removed Server from Registry")
-
 	if done, err := r.runNetworkCheck(ctx, server); err != nil || !done {
 		return false, err
 	}
 
+	if err := r.invalidateRegistryEntryForServer(ctx, server); err != nil {
+		return false, fmt.Errorf("failed to invalidate registry entry for server: %w", err)
+	}
+	log.V(1).Info("Removed Server from Registry")
+
 	log.V(1).Info("Setting Server state to available")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller.go` around lines 424 - 431, The code
deletes the registry discovery snapshot before performing the network gate
check, which can leave the server in Discovery without data if the gate fails;
change the flow in the surrounding function so that runNetworkCheck(ctx, server)
is called first and only when it returns done==true (and no error) call
invalidateRegistryEntryForServer(ctx, server) and log removal; keep
runNetworkCheck and invalidateRegistryEntryForServer calls and their error
handling but swap their order and ensure errors from runNetworkCheck are
returned without deleting the registry entry.

1099-1110: ⚠️ Potential issue | 🟠 Major

Resolve ServerNetworkConfig by the deterministic object name.

Both paths list all SNCs and select the first matching spec.serverRef.name, while ensureServerNetworkConfig creates/gets the canonical object named server.Name. A second SNC referencing the same server can make gate evaluation order-dependent.

♻️ Suggested helper-based fix
+func (r *ServerReconciler) getServerNetworkConfig(ctx context.Context, server *metalv1alpha1.Server) (*metalv1alpha1.ServerNetworkConfig, error) {
+	networkConfig := &metalv1alpha1.ServerNetworkConfig{}
+	if err := r.Get(ctx, client.ObjectKey{Name: server.Name}, networkConfig); err != nil {
+		return nil, err
+	}
+	return networkConfig, nil
+}
+
 func (r *ServerReconciler) runNetworkCheck(ctx context.Context, server *metalv1alpha1.Server) (bool, error) {
 	log := ctrl.LoggerFrom(ctx)
 
-	networkConfigList := &metalv1alpha1.ServerNetworkConfigList{}
-	if err := r.List(ctx, networkConfigList); err != nil {
-		return false, fmt.Errorf("failed to list ServerNetworkConfigs: %w", err)
-	}
-
-	var networkConfig *metalv1alpha1.ServerNetworkConfig
-	for i := range networkConfigList.Items {
-		if networkConfigList.Items[i].Spec.ServerRef.Name == server.Name {
-			networkConfig = &networkConfigList.Items[i]
-			break
-		}
-	}
-
-	if networkConfig == nil {
+	networkConfig, err := r.getServerNetworkConfig(ctx, server)
+	if apierrors.IsNotFound(err) {
 		return false, fmt.Errorf("ServerNetworkConfig not found for server %s — should have been created at Discovery start", server.Name)
 	}
+	if err != nil {
+		return false, fmt.Errorf("failed to get ServerNetworkConfig: %w", err)
+	}

Apply the same getServerNetworkConfig helper in populateCurrentStatus.

Also applies to: 1279-1293

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller.go` around lines 1099 - 1110,
populateCurrentStatus currently lists ServerNetworkConfigList and picks the
first item whose Spec.ServerRef.Name matches the server, which can pick the
wrong object; replace that logic to call the existing getServerNetworkConfig
helper used by ensureServerNetworkConfig to deterministically fetch the
ServerNetworkConfig by the canonical name (server.Name) and return/handle
not-found appropriately; apply the same replacement for the duplicate logic at
the other occurrence (around lines 1279-1293) so both populateCurrentStatus and
the second block consistently use getServerNetworkConfig instead of scanning
ServerNetworkConfigList.
internal/controller/server_controller_test.go (1)

1397-1407: ⚠️ Potential issue | 🟡 Minor

Let the SNC watch drive the recovery path.

After correcting spec.interfaces, this test should wait for the ServerNetworkConfig watch to enqueue the Server; re-posting discovery data can hide a broken watch or stale-snapshot regression.

🧪 Suggested test adjustment
 			By("Correcting the SNC spec to match real data")
 			Eventually(func(g Gomega) {
 				g.Expect(k8sClient.Get(ctx, client.ObjectKey{Name: server.Name}, snc)).To(Succeed())
 				snc.Spec.Interfaces = []metalv1alpha1.SpecNetworkInterface{
 					{Name: "eth0", MACAddress: testMAC1, Expected: &metalv1alpha1.ExpectedNetworkDetails{Switch: testSwitch1, Port: testPort1}},
 				}
 				g.Expect(k8sClient.Update(ctx, snc)).To(Succeed())
 			}).Should(Succeed())
 
-			By("Re-posting discovery data to trigger another reconcile")
-			registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1)
+			By("Waiting for the ServerNetworkConfig watch to trigger another reconcile")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller_test.go` around lines 1397 - 1407, The
test currently forces another reconcile by calling registerFakeDiscovery after
updating the ServerNetworkConfig (snc); instead, remove that call and wait for
the SNC watch to drive reconciliation by polling the Server resource until the
controller-side change is observed. Concretely: after k8sClient.Update(ctx, snc)
(the SNC update in the Eventually block), replace
registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1) with an Eventually
that fetches the Server (client.ObjectKey{Name: server.Name}) via k8sClient.Get
and asserts the controller-applied change (e.g., updated Status fields,
populated network interface info, or changed ResourceVersion/annotation the
controller sets) so the test relies on the SNC watch triggering reconcile rather
than reposting discovery.
api/v1alpha1/servernetworkconfig_types.go (1)

14-16: ⚠️ Potential issue | 🟠 Major

Make serverRef.name non-empty at the API boundary.

LocalObjectReference makes serverRef present but does not by itself require a non-empty name; serverRef: {} is then ignored by the watch mapper and cannot be matched by the gate.

🛡️ Suggested API shape
 import (
-	v1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 )
 
+// ServerNetworkConfigServerReference references the Server this config applies to.
+type ServerNetworkConfigServerReference struct {
+	// Name is the name of the referenced Server.
+	// +required
+	// +kubebuilder:validation:MinLength=1
+	Name string `json:"name"`
+}
+
 // ServerNetworkConfigSpec defines the expected network topology for a server,
 // written by an external source such as argora/NetBox.
 type ServerNetworkConfigSpec struct {
 	// ServerRef references the Server this config applies to.
 	// +required
-	ServerRef v1.LocalObjectReference `json:"serverRef"`
+	ServerRef ServerNetworkConfigServerReference `json:"serverRef"`

Run this read-only check after regenerating manifests; spec.serverRef.name should include a non-empty validation such as minLength: 1:

#!/bin/bash
set -euo pipefail

crd="$(fd 'metal\.ironcore\.dev_servernetworkconfigs\.yaml$' . | head -n1)"
test -n "$crd"

sed -n '/serverRef:/,/interfaces:/p' "$crd"

Based on learnings, "When a Kubernetes controller (reconciler) errors out and does not proceed if a CRD spec field is nil/empty, that field should not be treated as optional in the API schema."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/servernetworkconfig_types.go` around lines 14 - 16, The
ServerRef field currently uses v1.LocalObjectReference which allows an empty
name; replace it with a small local struct (e.g., type ServerRef struct { Name
string `json:"name"` // +kubebuilder:validation:MinLength=1 } ) or an inline
anonymous struct for the ServerRef field so that spec.serverRef.name has a
kubebuilder JSONSchema validation of minLength 1; update the ServerRef field
declaration (currently ServerRef v1.LocalObjectReference) to use that new struct
type/name, regenerate CRDs/manifests, and verify spec.serverRef.name appears
with minLength: 1 in the generated CRD.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/servernetworkconfig_types.go`:
- Around line 33-35: The schema currently allows spec.interfaces entries with
Name/MAC but Expected==nil which runNetworkCheck skips; update
servernetworkconfig_types.go to prevent this by adding a validation tag to the
interface type so Expected is required for populated interfaces (e.g. add a
kubebuilder validation tag / remove +optional on Expected in the Interface
struct), and also make runNetworkCheck treat Expected == nil as a mismatch (do
not skip) by changing the check in runNetworkCheck to record/return a mismatch
error when Expected is nil for any non-empty interface entry; reference the
Expected field and the runNetworkCheck function when making these edits.

In `@internal/controller/server_controller.go`:
- Around line 105-106: The RBAC markers for ServerNetworkConfig are missing the
create verb so in-cluster reconciliation fails when ensureServerNetworkConfig
calls r.Create; update the kubebuilder RBAC marker that references the
servernetworkconfigs resource to include create (e.g. change the verbs list from
get;list;watch;update;patch to get;list;watch;create;update;patch) so the
controller can successfully call r.Create in ensureServerNetworkConfig and
handle status/finalizer operations as needed.

---

Duplicate comments:
In `@api/v1alpha1/servernetworkconfig_types.go`:
- Around line 14-16: The ServerRef field currently uses v1.LocalObjectReference
which allows an empty name; replace it with a small local struct (e.g., type
ServerRef struct { Name string `json:"name"` //
+kubebuilder:validation:MinLength=1 } ) or an inline anonymous struct for the
ServerRef field so that spec.serverRef.name has a kubebuilder JSONSchema
validation of minLength 1; update the ServerRef field declaration (currently
ServerRef v1.LocalObjectReference) to use that new struct type/name, regenerate
CRDs/manifests, and verify spec.serverRef.name appears with minLength: 1 in the
generated CRD.

In `@internal/controller/server_controller_test.go`:
- Around line 1397-1407: The test currently forces another reconcile by calling
registerFakeDiscovery after updating the ServerNetworkConfig (snc); instead,
remove that call and wait for the SNC watch to drive reconciliation by polling
the Server resource until the controller-side change is observed. Concretely:
after k8sClient.Update(ctx, snc) (the SNC update in the Eventually block),
replace registerFakeDiscovery(testSystemUUID, testSwitch1, testPort1) with an
Eventually that fetches the Server (client.ObjectKey{Name: server.Name}) via
k8sClient.Get and asserts the controller-applied change (e.g., updated Status
fields, populated network interface info, or changed ResourceVersion/annotation
the controller sets) so the test relies on the SNC watch triggering reconcile
rather than reposting discovery.

In `@internal/controller/server_controller.go`:
- Around line 424-431: The code deletes the registry discovery snapshot before
performing the network gate check, which can leave the server in Discovery
without data if the gate fails; change the flow in the surrounding function so
that runNetworkCheck(ctx, server) is called first and only when it returns
done==true (and no error) call invalidateRegistryEntryForServer(ctx, server) and
log removal; keep runNetworkCheck and invalidateRegistryEntryForServer calls and
their error handling but swap their order and ensure errors from runNetworkCheck
are returned without deleting the registry entry.
- Around line 1099-1110: populateCurrentStatus currently lists
ServerNetworkConfigList and picks the first item whose Spec.ServerRef.Name
matches the server, which can pick the wrong object; replace that logic to call
the existing getServerNetworkConfig helper used by ensureServerNetworkConfig to
deterministically fetch the ServerNetworkConfig by the canonical name
(server.Name) and return/handle not-found appropriately; apply the same
replacement for the duplicate logic at the other occurrence (around lines
1279-1293) so both populateCurrentStatus and the second block consistently use
getServerNetworkConfig instead of scanning ServerNetworkConfigList.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 30515c59-5110-48b5-be0f-63b8645565c5

📥 Commits

Reviewing files that changed from the base of the PR and between c49c5ba and 394eb5c.

⛔ Files ignored due to path filters (3)
  • dist/chart/templates/crd/metal.ironcore.dev_servernetworkconfigs.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_servers.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/role.yaml is excluded by !**/dist/**
📒 Files selected for processing (9)
  • api/v1alpha1/server_types.go
  • api/v1alpha1/servernetworkconfig_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/metal.ironcore.dev_servernetworkconfigs.yaml
  • config/crd/bases/metal.ironcore.dev_servers.yaml
  • config/rbac/role.yaml
  • docs/api-reference/api.md
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
💤 Files with no reviewable changes (1)
  • config/crd/bases/metal.ironcore.dev_servers.yaml
✅ Files skipped from review due to trivial changes (2)
  • config/crd/bases/metal.ironcore.dev_servernetworkconfigs.yaml
  • config/rbac/role.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/api-reference/api.md

Comment thread api/v1alpha1/servernetworkconfig_types.go Outdated
Comment thread internal/controller/server_controller.go Outdated
- SNC is cluster-scoped, auto-created by metal-operator at Discovery start
- spec.interfaces: expected topology (argora writes)
- status.interfaces: all interfaces with full detail (metal-operator writes)
- Server.Status.NetworkInterfaces: active (up) interfaces only, name/MAC/IPs/carrierStatus
- NetworkCheck condition on Server.Status.Conditions (NoConstraint/Pending/Passed/Failed)
- Gate is opt-in: empty spec.interfaces → NoConstraint, server proceeds
- runNetworkCheck reads LLDP from snc.Status.Interfaces
- 3 controller tests covering all condition states
@xkonni xkonni force-pushed the feat/networkCheck branch from 394eb5c to 6e40406 Compare April 22, 2026 07:06
- Make Expected field required (non-pointer value) in SpecNetworkInterface
- Add create verb to servernetworkconfigs RBAC marker
@xkonni xkonni force-pushed the feat/networkCheck branch from 6e40406 to af4bd07 Compare April 22, 2026 07:47
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
api/v1alpha1/server_types.go (1)

27-30: Constant placement: consider a dedicated condition-types block.

ServerConditionTypeNetworkCheck is added to the same const (...) block that mixes Power-typed values and topology annotation keys. It works, but grouping server condition type constants in their own block (or near ServerStatus.Conditions) would improve discoverability as more condition types are added. Purely a readability nit — no functional issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/server_types.go` around lines 27 - 30, Move the
ServerConditionTypeNetworkCheck constant out of the mixed const block and place
it in its own "Server condition types" const block located near
ServerStatus.Conditions (or wherever ServerStatus is defined) so condition-type
constants are grouped together for discoverability; update any references to
ServerConditionTypeNetworkCheck if you change its grouping but do not rename the
constant itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@api/v1alpha1/server_types.go`:
- Around line 27-30: Move the ServerConditionTypeNetworkCheck constant out of
the mixed const block and place it in its own "Server condition types" const
block located near ServerStatus.Conditions (or wherever ServerStatus is defined)
so condition-type constants are grouped together for discoverability; update any
references to ServerConditionTypeNetworkCheck if you change its grouping but do
not rename the constant itself.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c111f6cc-0fa1-4ba5-921a-dc6f98ea4eae

📥 Commits

Reviewing files that changed from the base of the PR and between 394eb5c and af4bd07.

⛔ Files ignored due to path filters (3)
  • dist/chart/templates/crd/metal.ironcore.dev_servernetworkconfigs.yaml is excluded by !**/dist/**
  • dist/chart/templates/crd/metal.ironcore.dev_servers.yaml is excluded by !**/dist/**
  • dist/chart/templates/rbac/role.yaml is excluded by !**/dist/**
📒 Files selected for processing (9)
  • api/v1alpha1/server_types.go
  • api/v1alpha1/servernetworkconfig_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/metal.ironcore.dev_servernetworkconfigs.yaml
  • config/crd/bases/metal.ironcore.dev_servers.yaml
  • config/rbac/role.yaml
  • docs/api-reference/api.md
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
💤 Files with no reviewable changes (1)
  • config/crd/bases/metal.ironcore.dev_servers.yaml
✅ Files skipped from review due to trivial changes (2)
  • internal/controller/server_controller_test.go
  • api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • config/rbac/role.yaml
  • api/v1alpha1/servernetworkconfig_types.go
  • internal/controller/server_controller.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change area/metal-automation documentation Improvements or additions to documentation size/XXL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add ServerNetworkConfig CRD and network topology check at Discovery

2 participants