Skip to content

perf: fix 5 cancelled CI jobs — Diffusion models OOM/timeout from eager weight allocation #1136

@ooples

Description

@ooples

Summary

5 GitHub Actions jobs hit the 45-minute workflow timeout and get cancelled on every PR run (baseline: PR #1135 run 24398739627):

  • Tests (net10.0) - ModelFamily - Diffusion A-I
  • Tests (net10.0) - ModelFamily - Diffusion J-R
  • Tests (net10.0) - ModelFamily - Diffusion S-Z
  • Tests (net10.0) - ModelFamily - Generated Layers
  • Tests (net10.0) - ModelFamily - NeuralNetworks
  • Tests (net10.0) - Unit - 03 Diffusion/Encoding
  • Tests (net10.0) - Unit - 08e NN-Remaining (catch-all)

Across the five cancelled jobs: ~370 per-test timeouts (xunit 60s/120s [Fact(Timeout=...)]) and ~950 OutOfMemoryException. Only 18-638 tests finish before the wall-clock kill.

Tests purposely use production defaults to catch real performance bugs, so the fix must be in the model code — not in the tests.

This issue covers PR 1 of a series: the Diffusion jobs. Follow-up issues/PRs will clean up NeuralNetworks, Generated Layers, and the two Unit shards using the same infrastructure.

Root causes (verified against code)

Every Diffusion OOM traces through the same path:

System.OutOfMemoryException
  at TensorAllocator.Rent[T](Int32[] shape)
  at DenseLayer`1..ctor (DenseLayer.cs:367)
  at DiTNoisePredictor`1.InitializeLayers (DiTNoisePredictor.cs:334 or 338)
  at DiTNoisePredictor`1.EnsureLayersInitialized (DiTNoisePredictor.cs:301)
  at DiTNoisePredictor`1.PredictNoise (442)  OR  .GetParameters (966)
  at <Model>.GetParameters
  at DiffusionModelTestBase.Parameters_ShouldBeNonEmpty (332)

DiT-XL defaults: hiddenSize = 1152, numLayers = 28, mlpRatio = 4.0 → ~4 GB of eagerly-allocated weight tensors per model instance. Rented tensors are never returned to the pool, so sequential tests across 255 diffusion models stack up on 16 GB Windows runners.

Five contributing factors:

  1. Noise predictor ctors allocate eagerly. DenseLayer ctor at src/NeuralNetworks/Layers/DenseLayer.cs:367 calls TensorAllocator.Rent<T> unless the caller passes a LazyInitializationStrategy<T>. DiT/MMDiT/UNet predictors don't, so every new <Model>() allocates the full parameter set up front.
  2. No Dispose path returns rented weights to the pool. TensorAllocator.Return exists (used for transient tensors only). LayerBase.Dispose (src/NeuralNetworks/Layers/LayerBase.cs:3202) unregisters GPU state but doesn't return rented weight buffers. NeuralNetworkBase.Dispose doesn't dispose child layers.
  3. GetParameters() forces lazy init. NeuralNetworkBase.GetParameters iterates all layers and calls layer.GetParameters(), which in DenseLayer unconditionally calls EnsureInitialized() (line 1194) — defeats any lazy opt.
  4. MultiHeadAttentionLayer and LayerNormalizationLayer have no lazy path — every DiT block packs MHA + MLP + LayerNorm, all eager.
  5. Tests don't dispose models. DiffusionModelTestBase uses using var _arena = TensorArena.Create() for transient tensors but lets the model reference fall out of scope normally.

Top offenders (Diffusion job)

Model Timeouts OOMs
IPAdapterFaceID 10
PaintByExample 9
SDXL 8
InstantID 6
HDPainter 50
CogVideo 50
ControlNetUnionPro 49
RecraftV 47
CatVTON 46
Imagen 36

Fix plan (5 parts, one PR)

Part 1 — Lazy init in noise predictors

Thread LazyInitializationStrategy<T> into internal DenseLayer/ConvolutionalLayer creations in:

  • DiTNoisePredictor.cs, MMDiTNoisePredictor.cs, MMDiTXNoisePredictor.cs, EMMDiTPredictor.cs, FlagDiTPredictor.cs, FluxDoubleStreamPredictor.cs, AsymmDiTPredictor.cs, SiTPredictor.cs, UViTNoisePredictor.cs, UNetNoisePredictor.cs, VideoUNetPredictor.cs

Part 2 — Add lazy init to MHA and LayerNorm

  • src/NeuralNetworks/Layers/MultiHeadAttentionLayer.cs — add IsLazy branch, defer Q/K/V/O tensors
  • src/NeuralNetworks/Layers/LayerNormalizationLayer.cs — same for gamma/beta

Part 3 — Return rented weights on Dispose

  • LayerBase.cs — new protected ReturnPooledTensors() hook, called from Dispose(bool)
  • DenseLayer, ConvolutionalLayer, MultiHeadAttentionLayer, LayerNormalizationLayer — override to call TensorAllocator.Return(...) for their weight tensors
  • NeuralNetworkBase.Dispose — iterate Layers, dispose each IDisposable

Part 4 — Test lifecycle: dispose + GC hint

  • DiffusionModelTestBase.cs and NeuralNetworkModelTestBase.cs — wrap models in using var model = CreateModel();
  • Add IAsyncLifetime.DisposeAsync override that forces GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect(); between tests

Part 5 — Lazy-friendly Parameters_ShouldBeNonEmpty

  • Switch assertion from model.GetParameters().Length > 0 to model.ParameterCount > 0 — semantically identical, avoids forcing full flat-vector materialization just to count.

Branch

New branch off master: perf/diffusion-lazy-init-oom. Separate PR, not stacked on #1135.

Verification plan

  1. Local build on net10.0 and net471, 0 errors.
  2. Run 5 representative Diffusion tests locally (HDPainter, SDXL, IPAdapterFaceID, LumaRay3, CogVideo) and confirm construction memory is flat.
  3. CI: the 3 Diffusion ModelFamily jobs must transition CANCELLED → SUCCESS or FAILURE (test failures are acceptable at this stage — the goal is to unblock the CI signal).
  4. Memory spot-check with dotnet-counters: System.Runtime gen-2 size flat across 20 sequential Diffusion tests; TensorAllocator pool reuse climbing.

Follow-up PRs (not in this issue's scope)

  • PR 2: NeuralNetworks ModelFamily shard (VGG, Capsule, NEAT, FastText, VoxelCNN)
  • PR 3: Generated Layers shard
  • PR 4 (if needed): Unit-03 Diffusion/Encoding + Unit-08e catch-all

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    diffusionDiffusion pipelines and schedulersenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions