Skip to content

Boot Policy and Deterministic Boot Process #807

@afritzler

Description

@afritzler

Problem

Currently, the operator hardcodes PXE as the sole network boot method (SetPXEBootOnce) and uses it for every power-on cycle. There is no distinction between an initial provisioning boot (network) and a regular operational boot (local disk). This leads to two issues:

  1. No HTTP Boot support. HTTP Boot (UEFI HTTP) is increasingly preferred in modern data centers — it works across routed networks without DHCP relay and supports HTTPS for secure image delivery. The operator cannot serve this path today.

  2. No safe boot behavior on unexpected power transitions. If someone manually power-cycles a server outside the operator's control, the server boots from whatever the BIOS boot order dictates — potentially re-entering a PXE loop or booting an unintended image. There is no "safe stop" mechanism.

Design Principles

Boot overrides are best-effort signals

The Redfish BootSourceOverrideTarget property (Pxe, UefiHttp, Hdd, etc.) is not reliably honored across all vendors. On some hardware, setting UefiHttp still results in a PXE boot if that is what the BIOS network boot order defines. Because the operator cannot guarantee the actual boot method (due to vendor firmware behavior, DHCP configuration, and boot-operator behavior), the boot overrides serve primarily as validation gates and intent signals.

The actual boot method is determined by:

  1. BIOS boot order — configured by the operator (see below)
  2. Image content — a UKI is served via HTTP Boot; a traditional kernel/initramfs via PXE
  3. DHCP / boot-operator — infrastructure-level concerns outside the operator's direct control

Deterministic boot order: EFI Shell → HDD → Network

For the boot policy to work as intended, the BIOS boot order must be configured to:

Priority Device Purpose
1 EFI Shell Safe stop. Catches unexpected power transitions.
2 HDD Regular operation. Boots the installed OS from local disk.
3 Network Provisioning. Used when the operator sets a network boot override.

This boot order is an external prerequisite. The metal-operator does not enforce it — it is the responsibility of the infrastructure team or a separate provisioning tool to configure the BIOS boot order before servers are onboarded.

Given this boot order, three distinct boot behaviors emerge:

  • Operator-controlled network boot (first boot): The operator sets a Once boot override to Pxe or UefiHttp. The override takes precedence over the BIOS order → server network boots → installs to disk.
  • Operator-controlled regular boot (subsequent boots): The operator sets a Once boot override to Hdd. The override takes precedence → server boots from local disk.
  • Unexpected power transition (manual intervention, power glitch): No boot override is active (the previous Once override was consumed). The BIOS order takes over → EFI Shell is first → boot process stops. The server is in a safe, deterministic state and requires operator intervention to proceed.

Proposed Changes

1. Boot Policy API

Add a BootPolicy struct to ServerClaimSpec and ServerBootConfigurationSpec. Since ServerMaintenance embeds ServerBootConfigurationSpec via serverBootConfigurationTemplate.spec, it inherits the field automatically.

Types

// BootPolicy defines the boot behavior for a server across its lifecycle.
type BootPolicy struct {
    // FirstBoot specifies the network boot method for the initial provisioning boot.
    // The operator uses this to validate that the image contains the correct artifacts
    // and to set the boot override for the first boot cycle.
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Enum=Pxe;UefiHttp
    FirstBoot FirstBootMode `json:"firstBoot"`

    // Boot specifies the boot method for regular operation after initial provisioning.
    // +kubebuilder:validation:Enum=Hdd
    // +kubebuilder:default="Hdd"
    // +optional
    Boot BootMode `json:"boot,omitempty"`
}

// FirstBootMode specifies the network boot method for initial provisioning.
// +kubebuilder:validation:Enum=Pxe;UefiHttp
type FirstBootMode string

const (
    // FirstBootModePxe boots via PXE. Requires the image to contain
    // traditional kernel/initramfs artifacts.
    FirstBootModePxe FirstBootMode = "Pxe"

    // FirstBootModeUefiHttp boots via UEFI HTTP Boot. Requires the image
    // to contain a UKI (Unified Kernel Image) artifact.
    FirstBootModeUefiHttp FirstBootMode = "UefiHttp"
)

// BootMode specifies the boot method for regular operation.
// +kubebuilder:validation:Enum=Hdd
type BootMode string

const (
    // BootModeHdd boots from the local hard disk.
    BootModeHdd BootMode = "Hdd"
)

Image validation by firstBoot mode

The firstBoot mode determines what image artifacts are required. Validation happens before the ServerBootConfiguration is created — the controller that creates the SBC (ServerClaim controller or ServerMaintenance controller) validates the image against the firstBoot mode first:

firstBoot Required artifacts
Pxe Kernel + initramfs in OCI image
UefiHttp UKI media type in OCI image

If the image does not contain the required artifacts, the SBC is not created. The controller emits an event on the requesting resource (ServerClaim or ServerMaintenance) and does not proceed. This avoids creating a resource that the boot-operator or other controllers might react to and fight over.

ServerBootConfigurationStatus — new field

Status field Type Description
httpBootURI string (URI) The resolved HTTP Boot URI. Set by the boot-operator after resolving the UKI artifact from the OCI image. Optional — if absent, the BMC obtains the URI via DHCP (option 59 / DHCPv6 option 60).

First boot tracking — annotation on ServerBootConfiguration

The operator tracks whether the initial provisioning boot has occurred using an annotation on the ServerBootConfiguration resource.

Annotation Value Description
metal.ironcore.dev/provisioned "true" Set by the controller on the SBC after the first boot succeeds.

Controller logic:

  • SBC has provisioned annotation → regular boot (use bootPolicy.boot)
  • SBC does not have provisioned annotation → first boot (use bootPolicy.firstBoot)

No explicit cleanup logic is needed. The provisioning state is naturally scoped to the SBC lifecycle:

  • New claim → new SBC is created without the annotation → first boot
  • Claim deleted → SBC is garbage collected → state gone
  • Discovery → internal SBC is created and deleted after discovery → no state leaks
  • Maintenance → separate SBC → has its own independent annotation (or not — maintenance always uses firstBoot mode regardless)
  • CAPI move → SBC is moved with its annotations intact → state preserved

2. Resource Examples

ServerClaim — PXE provisioning (default, same as today)

apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerClaim
metadata:
  name: my-claim
spec:
  power: "On"
  image: my-osimage:latest
  ignitionSecretRef:
    name: my-ignition-secret
  bootPolicy:
    firstBoot: Pxe
    boot: Hdd

ServerClaim — HTTP Boot provisioning with UKI

apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerClaim
metadata:
  name: my-claim-httpboot
spec:
  power: "On"
  image: my-uki-osimage:latest           # OCI image containing a UKI artifact
  ignitionSecretRef:
    name: my-ignition-secret
  bootPolicy:
    firstBoot: UefiHttp
    boot: Hdd

ServerBootConfiguration — created from claim

apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerBootConfiguration
metadata:
  name: my-claim-sbc
spec:
  serverRef:
    name: my-server
  image: my-uki-osimage:latest
  ignitionSecretRef:
    name: my-ignition-secret
  bootPolicy:
    firstBoot: UefiHttp
    boot: Hdd
# After boot-operator resolves the UKI artifact:
# status:
#   state: Ready
#   httpBootURI: "https://boot.example.com/artifacts/abc123/my-osimage.efi"

ServerMaintenance — firmware update via HTTP Boot

Since ServerMaintenance.spec.serverBootConfigurationTemplate.spec embeds ServerBootConfigurationSpec, the bootPolicy field is automatically available:

apiVersion: metal.ironcore.dev/v1alpha1
kind: ServerMaintenance
metadata:
  name: firmware-update
  annotations:
    metal.ironcore.dev/reason: "Scheduled firmware update"
spec:
  serverRef:
    name: my-server
  policy: Enforced
  priority: 100
  serverPower: "On"
  serverBootConfigurationTemplate:
    name: firmware-update-boot
    spec:
      serverRef:
        name: my-server
      image: firmware-update-uki:latest
      ignitionSecretRef:
        name: firmware-update-ignition
      bootPolicy:
        firstBoot: UefiHttp
        boot: Hdd

3. Boot Lifecycle

First boot tracking

  • After the server powers on successfully for the first time in Reserved state, the controller sets the annotation metal.ironcore.dev/provisioned: "true" on the SBC.
  • No reset or cleanup logic is needed — the annotation is scoped to the SBC's lifecycle. When the SBC is deleted (claim removed, re-provisioning), the state is gone. A new SBC starts without the annotation.
  • Entry into Maintenance state does not affect the claim SBC's annotation — maintenance uses its own SBC and always performs a network boot. When maintenance ends, the server returns to Reserved and the claim SBC still has its annotation → resumes HDD boot.
  • CAPI move: The SBC is moved alongside the Server with all annotations intact → provisioning state is preserved.

Discovery lifecycle (no claim, no maintenance)

Discovery is an internal operator concern. No ServerClaim or ServerMaintenance exists, so no boot policy is consulted. The operator always uses PXE for the discovery boot — this is hardcoded, not configurable.

1. Server enters Initial state:
   a. Operator creates internal SBC for discovery (no provisioned annotation)
   b. PXE boot override → PowerOn
2. Server enters Discovery state:
   a. Server PXE boots, probe agent registers
3. Server enters Available state:
   a. Server powers off
   b. Internal discovery SBC is deleted — no state leaks
4. ServerClaim arrives → Normal claim lifecycle begins (see below)

Normal claim lifecycle

1. ServerClaim created with bootPolicy: {firstBoot: Pxe, boot: Hdd}
2. ServerClaim controller creates SBC with bootPolicy propagated (no provisioned annotation)
3. Boot-operator validates image contains kernel/initramfs → SBC state = Ready
4. Server enters Reserved state:
   a. SBC has no provisioned annotation → network boot override (Pxe) → PowerOn
   b. Server network boots, installs OS to disk
   c. Server reaches running state → provisioned annotation set on SBC
5. Operator reboots server (e.g. power cycle via spec):
   a. SBC has provisioned annotation → HDD boot override → PowerOn
   b. Server boots from local disk
6. Manual power cycle (outside operator control):
   a. No boot override active (Once was consumed)
   b. BIOS order: EFI Shell → boot stops
   c. Operator must intervene to resume

Maintenance lifecycle

1. ServerMaintenance created with bootPolicy: {firstBoot: UefiHttp, boot: Hdd}
2. Server enters Maintenance state, maintenance SBC is created (no provisioned annotation)
3. Boot-operator validates UKI, writes httpBootURI → SBC state = Ready
4. Maintenance boot:
   a. Always uses firstBoot mode → network boot override (UefiHttp)
   b. Server boots maintenance image via HTTP Boot
5. Maintenance completes, ServerMaintenance removed, maintenance SBC deleted
6. Server returns to Reserved:
   a. Claim SBC still has provisioned annotation (unaffected by maintenance)
   b. HDD boot override → server resumes from local disk

Backwards Compatibility

  • bootPolicy is optional on ServerClaimSpec. If absent, the ServerClaim controller defaults to {firstBoot: Pxe, boot: Hdd} — identical to today's PXE-only behavior for the first boot. The HDD boot override for subsequent boots is new behavior but safe: servers provisioned via PXE already have an OS on disk.
  • CRD upgrade is additive — new optional fields require no migration.
  • Existing SBCs have no provisioned annotation, so existing servers in Reserved state will perform one network boot on the next power cycle (matching current behavior), then switch to HDD boot.
  • The BIOS boot order (EFI Shell → HDD → Network) is an external prerequisite, not enforced by the operator. Existing setups with a different boot order continue to work but do not benefit from the EFI Shell safe-stop behavior.

Future Extensions

  • VirtualMedia — A third FirstBootMode value for mounting an ISO image via Redfish Virtual Media. Planned as a near-term addition.
  • Additional BootMode values — e.g., Network for always-network-boot setups (stateless servers).

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Backlog

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions