Skip to content

Adopt Tensors' CacheOptimizer / LoopOptimizer in hand-rolled CPU hot paths #1158

@ooples

Description

@ooples

Context

Tensors ships cache-aware + loop-aware helpers that AiDotNet's hand-rolled CPU code should use instead of rolling its own loops:

AiDotNet.Tensors.Engines.Optimization.CacheOptimizer

  • `ComputeOptimalTiling(m, n, k, elementSize)` — matmul tile picker based on L1/L2/L3 sizes
  • `L1BlockSize` / `L2BlockSize` / `L3BlockSize` — platform-detected sizes
  • `CopyWithPrefetch` — software-prefetched copy
  • `MortonEncode` / `MortonDecode` — space-filling curve for cache locality
  • `TransposeBlocked` — cache-aware transpose
  • `EstimateCacheMisses` — for pre-flight analysis

AiDotNet.Tensors.Engines.Optimization.LoopOptimizer

  • `DetermineOptimalTileSize(dim, elementSize)`
  • `Tile2D` / `Tile3D` / `ParallelTile2D` — correct tiling with closure
  • `OptimalOrder2D(rows, cols, rowMajorAccess, action)` — picks row-or-column-major order based on access pattern
  • `StripMine(totalSize, stripSize, action)`
  • `UnrollBy4` / `UnrollBy8`
  • `Fuse(length, actions)`

Hot paths in AiDotNet that should adopt these

Most layers already go through `Engine.*` (which internally uses these). The real opportunity is in AiDotNet-owned CPU code that still has hand-rolled loops:

  1. `src/NeuralNetworks/Attention/FlashAttention.cs` (900 LOC) — block iteration over Q/KV blocks. Currently manual outer loop. Would benefit from `LoopOptimizer.Tile2D` + `LoopOptimizer.UnrollBy8` on the inner dot-product accumulator.
  2. `src/LinearAlgebra/*.cs` — any remaining hand-rolled matmul fallbacks.
  3. Custom loss function gradients that iterate element-by-element.
  4. Sparse / embedding layers with custom scatter/gather loops.

Suggested path

Per-file retrofit. Not a single atomic PR — each file's loops should be rewritten with care and benchmarked. `PerformanceProfiler` (now wired via `EnableTensorsOpProfiling()`) surfaces the bottlenecks to prioritize.

Estimated scope

~50-100 LOC per retrofit, ~10 hot files. Parallel-per-file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions