Periodically clean up windows_mdm_ tables so that they do not grow unreasonably large

### Problem

The four primary Windows MDM tables are effectively append-only. Nothing in the codebase prunes them on a time basis; the only deletions are (a) per-command dequeue at ack time inside `MDMWindowsSaveResponse` and (b) FK CASCADE when an entire enrollment row is removed on explicit unenroll. That means steady-state usage accumulates rows indefinitely:

- `windows_mdm_command_results` (`PRIMARY KEY (enrollment_id, command_uuid)`, plus `mediumtext raw_result`): one row per command ack. Never pruned. For a 40k-host fleet with regular profile reconciliation, this grows to the hundreds of millions or low billions of rows within the first year of a deployment.
- `windows_mdm_responses` (auto-increment id, plus `mediumtext raw_response`): one row per management session POST, containing the raw SyncML body. Never pruned. Grows faster than results and takes substantially more disk per row.
- `windows_mdm_commands` (`PRIMARY KEY command_uuid`, plus `mediumtext raw_command`): one row per unique issued command. Never pruned. Shared across hosts, so growth is lower than results, but still unbounded.
- `windows_mdm_command_queue` (`PRIMARY KEY (enrollment_id, command_uuid)`): shrinks as devices ack, but accumulates stale rows for devices that enroll and then go offline without ever acking — each subsequent profile reconciliation adds more queue entries that will never be removed without explicit unenrollment.

The reader pool cost of the pending-commands query scales with these tables:

- The `NOT EXISTS` anti-join against `windows_mdm_command_results` is a per-row PK probe. Per call it is `O(log R)`, which is cheap while the results PK index fits in the InnoDB buffer pool. Once the index exceeds buffer-pool capacity (which happens within months on a large fleet at current growth rates), probes start missing and each one becomes a disk seek. Constant-factor latency rises by two to three orders of magnitude.
- `windows_mdm_responses` has no effect on query cost directly, but the `mediumtext` column means storage footprint grows far faster than row count — customers see pressure on disk space and backup volume size before they see query latency.

There is no equivalent of Apple's `CleanupHostMDMCommands` (cron-driven, 1-day threshold on `host_mdm_commands`) or the Apple refetch-specific pruning tracked in [#34950](https://github.com/fleetdm/fleet/issues/34950) for these Windows tables. The companion reliability fix in `bugs/windows-mdm-pending-commands-flood.md` mitigates the hot-path scan with a `SELECT EXISTS(...)` short-circuit but does not address unbounded table growth.

### Impact

- **Severity:** medium-high, growing with deployment age. Correctness is unaffected; performance degrades gradually and then sharply once the results index leaves buffer-pool residency.
- **Frequency:** continuous accumulation. Every Windows MDM check-in that produces an ack writes a permanent row.
- **Blast radius:** every Fleet deployment with Windows MDM enabled. Effect compounds with fleet size and deployment lifetime. Large customers (40k+ Windows hosts, 12+ months uptime) are the first to feel it.
- **Related symptom:** during high-traffic phases (post re-enrollment bursts, profile fan-outs), the pending-commands query's anti-join starts touching cold pages, which evicts unrelated hot pages from the buffer pool and degrades adjacent read paths (host listing, policy ingestion). Same failure mode as documented for Apple in #34950, which the product group has already labelled as P2.

### Proposed fix

Model after the approach agreed on [#34950](https://github.com/fleetdm/fleet/issues/34950) for the Apple `nano_` tables: piggyback-on-ack plus a bounded cron sweep, with a conservative age threshold and small batch sizes to avoid any single DELETE becoming expensive.

1. **Ack-time piggyback in `MDMWindowsSaveResponse` (`server/datastore/mysql/microsoft_mdm.go`).** After the existing dequeue-and-insert block succeeds, issue a bounded `DELETE` against `windows_mdm_command_results` in the same transaction: up to 3 rows for this enrollment whose `created_at < NOW() - INTERVAL 30 DAY` and whose `command_uuid` is not the row we just inserted. Matching Apple's choice, keep the batch small (3) to spread the deletion load across the natural ack rate rather than concentrate it in a sweep.

2. **Hourly cron sweep in `newCleanupsAndAggregationSchedule` (`cmd/fleet/cron.go`).** Add a `windows_mdm` job alongside the existing `policy_membership`, `carves`, etc. jobs. On each run:
   - `SELECT command_uuid FROM windows_mdm_commands WHERE created_at < NOW() - INTERVAL 30 DAY LIMIT 100`, then for each candidate verify there are no remaining references in `windows_mdm_command_queue` or `windows_mdm_command_results`, and delete orphans. FK `ON DELETE CASCADE` from both child tables guarantees a clean delete once the parent goes.
   - `DELETE FROM windows_mdm_responses WHERE created_at < NOW() - INTERVAL 30 DAY LIMIT 100` — raw responses have no downstream dependents and are pure storage overhead after processing.
   - `DELETE FROM windows_mdm_command_queue WHERE created_at < NOW() - INTERVAL 30 DAY LIMIT 100` — queue entries that have been sitting for a month belong to devices that will not ack them; their continued presence bloats the PK index that the hot-path probe scans.

3. **Transactional invariant.** At ack time the queue delete and results insert are already in the same `withRetryTxx` closure. The new ack-time results prune must live inside that same closure so that a crash between "insert new result" and "prune old result" cannot leave a half-applied state. This matches the invariant called out on #34950: queue and results must delete atomically for the same command_uuid.

4. **Age threshold: 30 days.** Matches Apple. Long enough that operator-visible history (e.g., "when did host X last ack this profile") stays useful; short enough to keep the index footprint bounded.

5. **Batch sizes: 3 (ack) and 100 (cron).** Matches Apple's load-tested choice. Bigger batches in the cron caused contention on the Apple side; no reason to expect Windows to behave differently.

6. **No schema migration, no API surface change, no UI change.** Internal reliability fix only. One docs callout on `fleetdm.com/guides/mdm-commands#troubleshooting` explaining that results history for Windows commands is retained for at least 30 days.

Different from Apple in two ways worth flagging:

- **No `REFETCH-` prefix filter.** Apple's [#34950](https://github.com/fleetdm/fleet/issues/34950) narrows to `REFETCH-*` because 95% of Apple-side bloat comes from periodic refetch commands. Windows has no equivalent high-frequency idempotent command; the bloat source is ordinary profile install/remove. The pruner applies to all commands uniformly.
- **`windows_mdm_responses` is in scope.** Apple has no direct analog of this table. It is append-only, uses `mediumtext`, and is a bigger disk-space concern per row than the others. Treat it as a first-class target for the sweep.

Follow-up candidate after this lands: the same retention pattern can be applied to `nano_commands` / `nano_command_results` / `nano_responses` (Apple) which [#34950](https://github.com/fleetdm/fleet/issues/34950) did not cover — it addresses only refetch-prefixed rows.

### Evidence

- Schema: `server/datastore/mysql/schema.sql` — all four tables use FK `ON DELETE CASCADE` from `mdm_windows_enrollments`, no time-based indexes, no retention mechanism.
- Code: the only DELETE statements against these tables in non-test code are the per-command dequeue in `MDMWindowsSaveResponse` and the per-command delete in `ResendWindowsMDMCommand`. Neither is time-based; both are correctness-driven. The cron in `cmd/fleet/cron.go` has no Windows-MDM-table cleanup job.
- Apple reference: [#34950 — Periodically clean up nano_ tables so that they do not grow unreasonably large](https://github.com/fleetdm/fleet/issues/34950), labelled P2 and tagged `customer-fairbank` / `customer-deebradel` / `customer-hawking`, is the canonical prior art. It identified the same class of unbounded table growth on the Apple side and settled on the ack-piggyback + hourly cron pattern this ticket mirrors.
- Companion fix: `bugs/windows-mdm-pending-commands-flood.md` reduces per-call cost on the pending-commands query via a PK-seek short-circuit. That fix leaves unbounded table growth as the remaining scaling risk.

### Related

- Apple prior art: [#34950 — Periodically clean up nano_ tables so that they do not grow unreasonably large](https://github.com/fleetdm/fleet/issues/34950)
- Upstream on Apple-side growth pressure: [#34797 — Improve GitOps speed when managing 100+ teams](https://github.com/fleetdm/fleet/issues/34797)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically clean up windows_mdm_ tables so that they do not grow unreasonably large #43897

Problem

Impact

Proposed fix

Evidence

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Periodically clean up windows_mdm_ tables so that they do not grow unreasonably large #43897

Description

Problem

Impact

Proposed fix

Evidence

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions