perf: fix 5 cancelled CI jobs — Diffusion models OOM/timeout from eager weight allocation

## Summary

5 GitHub Actions jobs hit the 45-minute workflow timeout and get cancelled on every PR run (baseline: PR #1135 run [24398739627](https://github.com/ooples/AiDotNet/actions/runs/24398739627)):

- `Tests (net10.0) - ModelFamily - Diffusion A-I`
- `Tests (net10.0) - ModelFamily - Diffusion J-R`
- `Tests (net10.0) - ModelFamily - Diffusion S-Z`
- `Tests (net10.0) - ModelFamily - Generated Layers`
- `Tests (net10.0) - ModelFamily - NeuralNetworks`
- `Tests (net10.0) - Unit - 03 Diffusion/Encoding`
- `Tests (net10.0) - Unit - 08e NN-Remaining (catch-all)`

Across the five cancelled jobs: **~370 per-test timeouts** (xunit 60s/120s `[Fact(Timeout=...)]`) and **~950 `OutOfMemoryException`**. Only 18-638 tests finish before the wall-clock kill.

Tests purposely use production defaults to catch real performance bugs, so the fix must be in the model code — not in the tests.

This issue covers **PR 1 of a series**: the Diffusion jobs. Follow-up issues/PRs will clean up NeuralNetworks, Generated Layers, and the two Unit shards using the same infrastructure.

## Root causes (verified against code)

Every Diffusion OOM traces through the same path:

```
System.OutOfMemoryException
  at TensorAllocator.Rent[T](Int32[] shape)
  at DenseLayer`1..ctor (DenseLayer.cs:367)
  at DiTNoisePredictor`1.InitializeLayers (DiTNoisePredictor.cs:334 or 338)
  at DiTNoisePredictor`1.EnsureLayersInitialized (DiTNoisePredictor.cs:301)
  at DiTNoisePredictor`1.PredictNoise (442)  OR  .GetParameters (966)
  at <Model>.GetParameters
  at DiffusionModelTestBase.Parameters_ShouldBeNonEmpty (332)
```

DiT-XL defaults: `hiddenSize = 1152`, `numLayers = 28`, `mlpRatio = 4.0` → ~4 GB of eagerly-allocated weight tensors **per model instance**. Rented tensors are never returned to the pool, so sequential tests across 255 diffusion models stack up on 16 GB Windows runners.

**Five contributing factors:**

1. **Noise predictor ctors allocate eagerly.** `DenseLayer` ctor at `src/NeuralNetworks/Layers/DenseLayer.cs:367` calls `TensorAllocator.Rent<T>` unless the caller passes a `LazyInitializationStrategy<T>`. DiT/MMDiT/UNet predictors don't, so every `new <Model>()` allocates the full parameter set up front.
2. **No Dispose path returns rented weights to the pool.** `TensorAllocator.Return` exists (used for transient tensors only). `LayerBase.Dispose` (`src/NeuralNetworks/Layers/LayerBase.cs:3202`) unregisters GPU state but doesn't return rented weight buffers. `NeuralNetworkBase.Dispose` doesn't dispose child layers.
3. **`GetParameters()` forces lazy init.** `NeuralNetworkBase.GetParameters` iterates all layers and calls `layer.GetParameters()`, which in `DenseLayer` unconditionally calls `EnsureInitialized()` (line 1194) — defeats any lazy opt.
4. **`MultiHeadAttentionLayer` and `LayerNormalizationLayer` have no lazy path** — every DiT block packs MHA + MLP + LayerNorm, all eager.
5. **Tests don't dispose models.** `DiffusionModelTestBase` uses `using var _arena = TensorArena.Create()` for transient tensors but lets the model reference fall out of scope normally.

## Top offenders (Diffusion job)

| Model | Timeouts | OOMs |
|---|---|---|
| IPAdapterFaceID | 10 | — |
| PaintByExample | 9 | — |
| SDXL | 8 | — |
| InstantID | 6 | — |
| HDPainter | — | 50 |
| CogVideo | — | 50 |
| ControlNetUnionPro | — | 49 |
| RecraftV | — | 47 |
| CatVTON | — | 46 |
| Imagen | — | 36 |

## Fix plan (5 parts, one PR)

### Part 1 — Lazy init in noise predictors
Thread `LazyInitializationStrategy<T>` into internal `DenseLayer`/`ConvolutionalLayer` creations in:
- `DiTNoisePredictor.cs`, `MMDiTNoisePredictor.cs`, `MMDiTXNoisePredictor.cs`, `EMMDiTPredictor.cs`, `FlagDiTPredictor.cs`, `FluxDoubleStreamPredictor.cs`, `AsymmDiTPredictor.cs`, `SiTPredictor.cs`, `UViTNoisePredictor.cs`, `UNetNoisePredictor.cs`, `VideoUNetPredictor.cs`

### Part 2 — Add lazy init to MHA and LayerNorm
- `src/NeuralNetworks/Layers/MultiHeadAttentionLayer.cs` — add `IsLazy` branch, defer Q/K/V/O tensors
- `src/NeuralNetworks/Layers/LayerNormalizationLayer.cs` — same for gamma/beta

### Part 3 — Return rented weights on Dispose
- `LayerBase.cs` — new protected `ReturnPooledTensors()` hook, called from `Dispose(bool)`
- `DenseLayer`, `ConvolutionalLayer`, `MultiHeadAttentionLayer`, `LayerNormalizationLayer` — override to call `TensorAllocator.Return(...)` for their weight tensors
- `NeuralNetworkBase.Dispose` — iterate `Layers`, dispose each `IDisposable`

### Part 4 — Test lifecycle: dispose + GC hint
- `DiffusionModelTestBase.cs` and `NeuralNetworkModelTestBase.cs` — wrap models in `using var model = CreateModel();`
- Add `IAsyncLifetime.DisposeAsync` override that forces `GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();` between tests

### Part 5 — Lazy-friendly `Parameters_ShouldBeNonEmpty`
- Switch assertion from `model.GetParameters().Length > 0` to `model.ParameterCount > 0` — semantically identical, avoids forcing full flat-vector materialization just to count.

## Branch

New branch off master: `perf/diffusion-lazy-init-oom`. Separate PR, not stacked on #1135.

## Verification plan

1. Local build on net10.0 and net471, 0 errors.
2. Run 5 representative Diffusion tests locally (HDPainter, SDXL, IPAdapterFaceID, LumaRay3, CogVideo) and confirm construction memory is flat.
3. CI: the 3 Diffusion ModelFamily jobs must transition CANCELLED → SUCCESS or FAILURE (test failures are acceptable at this stage — the goal is to unblock the CI signal).
4. Memory spot-check with dotnet-counters: `System.Runtime` gen-2 size flat across 20 sequential Diffusion tests; `TensorAllocator` pool reuse climbing.

## Follow-up PRs (not in this issue's scope)

- PR 2: NeuralNetworks ModelFamily shard (VGG, Capsule, NEAT, FastText, VoxelCNN)
- PR 3: Generated Layers shard
- PR 4 (if needed): Unit-03 Diffusion/Encoding + Unit-08e catch-all

## References

- Baseline run: https://github.com/ooples/AiDotNet/actions/runs/24398739627 (on PR #1135)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: fix 5 cancelled CI jobs — Diffusion models OOM/timeout from eager weight allocation #1136

Summary

Root causes (verified against code)

Top offenders (Diffusion job)

Fix plan (5 parts, one PR)

Part 1 — Lazy init in noise predictors

Part 2 — Add lazy init to MHA and LayerNorm

Part 3 — Return rented weights on Dispose

Part 4 — Test lifecycle: dispose + GC hint

Part 5 — Lazy-friendly `Parameters_ShouldBeNonEmpty`

Branch

Verification plan

Follow-up PRs (not in this issue's scope)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Timeouts	OOMs
IPAdapterFaceID	10	—
PaintByExample	9	—
SDXL	8	—
InstantID	6	—
HDPainter	—	50
CogVideo	—	50
ControlNetUnionPro	—	49
RecraftV	—	47
CatVTON	—	46
Imagen	—	36

Uh oh!

perf: fix 5 cancelled CI jobs — Diffusion models OOM/timeout from eager weight allocation #1136

Description

Summary

Root causes (verified against code)

Top offenders (Diffusion job)

Fix plan (5 parts, one PR)

Part 1 — Lazy init in noise predictors

Part 2 — Add lazy init to MHA and LayerNorm

Part 3 — Return rented weights on Dispose

Part 4 — Test lifecycle: dispose + GC hint

Part 5 — Lazy-friendly Parameters_ShouldBeNonEmpty

Branch

Verification plan

Follow-up PRs (not in this issue's scope)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Part 5 — Lazy-friendly `Parameters_ShouldBeNonEmpty`