Summary
5 GitHub Actions jobs hit the 45-minute workflow timeout and get cancelled on every PR run (baseline: PR #1135 run 24398739627):
Tests (net10.0) - ModelFamily - Diffusion A-I
Tests (net10.0) - ModelFamily - Diffusion J-R
Tests (net10.0) - ModelFamily - Diffusion S-Z
Tests (net10.0) - ModelFamily - Generated Layers
Tests (net10.0) - ModelFamily - NeuralNetworks
Tests (net10.0) - Unit - 03 Diffusion/Encoding
Tests (net10.0) - Unit - 08e NN-Remaining (catch-all)
Across the five cancelled jobs: ~370 per-test timeouts (xunit 60s/120s [Fact(Timeout=...)]) and ~950 OutOfMemoryException. Only 18-638 tests finish before the wall-clock kill.
Tests purposely use production defaults to catch real performance bugs, so the fix must be in the model code — not in the tests.
This issue covers PR 1 of a series: the Diffusion jobs. Follow-up issues/PRs will clean up NeuralNetworks, Generated Layers, and the two Unit shards using the same infrastructure.
Root causes (verified against code)
Every Diffusion OOM traces through the same path:
System.OutOfMemoryException
at TensorAllocator.Rent[T](Int32[] shape)
at DenseLayer`1..ctor (DenseLayer.cs:367)
at DiTNoisePredictor`1.InitializeLayers (DiTNoisePredictor.cs:334 or 338)
at DiTNoisePredictor`1.EnsureLayersInitialized (DiTNoisePredictor.cs:301)
at DiTNoisePredictor`1.PredictNoise (442) OR .GetParameters (966)
at <Model>.GetParameters
at DiffusionModelTestBase.Parameters_ShouldBeNonEmpty (332)
DiT-XL defaults: hiddenSize = 1152, numLayers = 28, mlpRatio = 4.0 → ~4 GB of eagerly-allocated weight tensors per model instance. Rented tensors are never returned to the pool, so sequential tests across 255 diffusion models stack up on 16 GB Windows runners.
Five contributing factors:
- Noise predictor ctors allocate eagerly.
DenseLayer ctor at src/NeuralNetworks/Layers/DenseLayer.cs:367 calls TensorAllocator.Rent<T> unless the caller passes a LazyInitializationStrategy<T>. DiT/MMDiT/UNet predictors don't, so every new <Model>() allocates the full parameter set up front.
- No Dispose path returns rented weights to the pool.
TensorAllocator.Return exists (used for transient tensors only). LayerBase.Dispose (src/NeuralNetworks/Layers/LayerBase.cs:3202) unregisters GPU state but doesn't return rented weight buffers. NeuralNetworkBase.Dispose doesn't dispose child layers.
GetParameters() forces lazy init. NeuralNetworkBase.GetParameters iterates all layers and calls layer.GetParameters(), which in DenseLayer unconditionally calls EnsureInitialized() (line 1194) — defeats any lazy opt.
MultiHeadAttentionLayer and LayerNormalizationLayer have no lazy path — every DiT block packs MHA + MLP + LayerNorm, all eager.
- Tests don't dispose models.
DiffusionModelTestBase uses using var _arena = TensorArena.Create() for transient tensors but lets the model reference fall out of scope normally.
Top offenders (Diffusion job)
| Model |
Timeouts |
OOMs |
| IPAdapterFaceID |
10 |
— |
| PaintByExample |
9 |
— |
| SDXL |
8 |
— |
| InstantID |
6 |
— |
| HDPainter |
— |
50 |
| CogVideo |
— |
50 |
| ControlNetUnionPro |
— |
49 |
| RecraftV |
— |
47 |
| CatVTON |
— |
46 |
| Imagen |
— |
36 |
Fix plan (5 parts, one PR)
Part 1 — Lazy init in noise predictors
Thread LazyInitializationStrategy<T> into internal DenseLayer/ConvolutionalLayer creations in:
DiTNoisePredictor.cs, MMDiTNoisePredictor.cs, MMDiTXNoisePredictor.cs, EMMDiTPredictor.cs, FlagDiTPredictor.cs, FluxDoubleStreamPredictor.cs, AsymmDiTPredictor.cs, SiTPredictor.cs, UViTNoisePredictor.cs, UNetNoisePredictor.cs, VideoUNetPredictor.cs
Part 2 — Add lazy init to MHA and LayerNorm
src/NeuralNetworks/Layers/MultiHeadAttentionLayer.cs — add IsLazy branch, defer Q/K/V/O tensors
src/NeuralNetworks/Layers/LayerNormalizationLayer.cs — same for gamma/beta
Part 3 — Return rented weights on Dispose
LayerBase.cs — new protected ReturnPooledTensors() hook, called from Dispose(bool)
DenseLayer, ConvolutionalLayer, MultiHeadAttentionLayer, LayerNormalizationLayer — override to call TensorAllocator.Return(...) for their weight tensors
NeuralNetworkBase.Dispose — iterate Layers, dispose each IDisposable
Part 4 — Test lifecycle: dispose + GC hint
DiffusionModelTestBase.cs and NeuralNetworkModelTestBase.cs — wrap models in using var model = CreateModel();
- Add
IAsyncLifetime.DisposeAsync override that forces GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect(); between tests
Part 5 — Lazy-friendly Parameters_ShouldBeNonEmpty
- Switch assertion from
model.GetParameters().Length > 0 to model.ParameterCount > 0 — semantically identical, avoids forcing full flat-vector materialization just to count.
Branch
New branch off master: perf/diffusion-lazy-init-oom. Separate PR, not stacked on #1135.
Verification plan
- Local build on net10.0 and net471, 0 errors.
- Run 5 representative Diffusion tests locally (HDPainter, SDXL, IPAdapterFaceID, LumaRay3, CogVideo) and confirm construction memory is flat.
- CI: the 3 Diffusion ModelFamily jobs must transition CANCELLED → SUCCESS or FAILURE (test failures are acceptable at this stage — the goal is to unblock the CI signal).
- Memory spot-check with dotnet-counters:
System.Runtime gen-2 size flat across 20 sequential Diffusion tests; TensorAllocator pool reuse climbing.
Follow-up PRs (not in this issue's scope)
- PR 2: NeuralNetworks ModelFamily shard (VGG, Capsule, NEAT, FastText, VoxelCNN)
- PR 3: Generated Layers shard
- PR 4 (if needed): Unit-03 Diffusion/Encoding + Unit-08e catch-all
References
Summary
5 GitHub Actions jobs hit the 45-minute workflow timeout and get cancelled on every PR run (baseline: PR #1135 run 24398739627):
Tests (net10.0) - ModelFamily - Diffusion A-ITests (net10.0) - ModelFamily - Diffusion J-RTests (net10.0) - ModelFamily - Diffusion S-ZTests (net10.0) - ModelFamily - Generated LayersTests (net10.0) - ModelFamily - NeuralNetworksTests (net10.0) - Unit - 03 Diffusion/EncodingTests (net10.0) - Unit - 08e NN-Remaining (catch-all)Across the five cancelled jobs: ~370 per-test timeouts (xunit 60s/120s
[Fact(Timeout=...)]) and ~950OutOfMemoryException. Only 18-638 tests finish before the wall-clock kill.Tests purposely use production defaults to catch real performance bugs, so the fix must be in the model code — not in the tests.
This issue covers PR 1 of a series: the Diffusion jobs. Follow-up issues/PRs will clean up NeuralNetworks, Generated Layers, and the two Unit shards using the same infrastructure.
Root causes (verified against code)
Every Diffusion OOM traces through the same path:
DiT-XL defaults:
hiddenSize = 1152,numLayers = 28,mlpRatio = 4.0→ ~4 GB of eagerly-allocated weight tensors per model instance. Rented tensors are never returned to the pool, so sequential tests across 255 diffusion models stack up on 16 GB Windows runners.Five contributing factors:
DenseLayerctor atsrc/NeuralNetworks/Layers/DenseLayer.cs:367callsTensorAllocator.Rent<T>unless the caller passes aLazyInitializationStrategy<T>. DiT/MMDiT/UNet predictors don't, so everynew <Model>()allocates the full parameter set up front.TensorAllocator.Returnexists (used for transient tensors only).LayerBase.Dispose(src/NeuralNetworks/Layers/LayerBase.cs:3202) unregisters GPU state but doesn't return rented weight buffers.NeuralNetworkBase.Disposedoesn't dispose child layers.GetParameters()forces lazy init.NeuralNetworkBase.GetParametersiterates all layers and callslayer.GetParameters(), which inDenseLayerunconditionally callsEnsureInitialized()(line 1194) — defeats any lazy opt.MultiHeadAttentionLayerandLayerNormalizationLayerhave no lazy path — every DiT block packs MHA + MLP + LayerNorm, all eager.DiffusionModelTestBaseusesusing var _arena = TensorArena.Create()for transient tensors but lets the model reference fall out of scope normally.Top offenders (Diffusion job)
Fix plan (5 parts, one PR)
Part 1 — Lazy init in noise predictors
Thread
LazyInitializationStrategy<T>into internalDenseLayer/ConvolutionalLayercreations in:DiTNoisePredictor.cs,MMDiTNoisePredictor.cs,MMDiTXNoisePredictor.cs,EMMDiTPredictor.cs,FlagDiTPredictor.cs,FluxDoubleStreamPredictor.cs,AsymmDiTPredictor.cs,SiTPredictor.cs,UViTNoisePredictor.cs,UNetNoisePredictor.cs,VideoUNetPredictor.csPart 2 — Add lazy init to MHA and LayerNorm
src/NeuralNetworks/Layers/MultiHeadAttentionLayer.cs— addIsLazybranch, defer Q/K/V/O tensorssrc/NeuralNetworks/Layers/LayerNormalizationLayer.cs— same for gamma/betaPart 3 — Return rented weights on Dispose
LayerBase.cs— new protectedReturnPooledTensors()hook, called fromDispose(bool)DenseLayer,ConvolutionalLayer,MultiHeadAttentionLayer,LayerNormalizationLayer— override to callTensorAllocator.Return(...)for their weight tensorsNeuralNetworkBase.Dispose— iterateLayers, dispose eachIDisposablePart 4 — Test lifecycle: dispose + GC hint
DiffusionModelTestBase.csandNeuralNetworkModelTestBase.cs— wrap models inusing var model = CreateModel();IAsyncLifetime.DisposeAsyncoverride that forcesGC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();between testsPart 5 — Lazy-friendly
Parameters_ShouldBeNonEmptymodel.GetParameters().Length > 0tomodel.ParameterCount > 0— semantically identical, avoids forcing full flat-vector materialization just to count.Branch
New branch off master:
perf/diffusion-lazy-init-oom. Separate PR, not stacked on #1135.Verification plan
System.Runtimegen-2 size flat across 20 sequential Diffusion tests;TensorAllocatorpool reuse climbing.Follow-up PRs (not in this issue's scope)
References