Context
Tensors ships cache-aware + loop-aware helpers that AiDotNet's hand-rolled CPU code should use instead of rolling its own loops:
AiDotNet.Tensors.Engines.Optimization.CacheOptimizer
- `ComputeOptimalTiling(m, n, k, elementSize)` — matmul tile picker based on L1/L2/L3 sizes
- `L1BlockSize` / `L2BlockSize` / `L3BlockSize` — platform-detected sizes
- `CopyWithPrefetch` — software-prefetched copy
- `MortonEncode` / `MortonDecode` — space-filling curve for cache locality
- `TransposeBlocked` — cache-aware transpose
- `EstimateCacheMisses` — for pre-flight analysis
AiDotNet.Tensors.Engines.Optimization.LoopOptimizer
- `DetermineOptimalTileSize(dim, elementSize)`
- `Tile2D` / `Tile3D` / `ParallelTile2D` — correct tiling with closure
- `OptimalOrder2D(rows, cols, rowMajorAccess, action)` — picks row-or-column-major order based on access pattern
- `StripMine(totalSize, stripSize, action)`
- `UnrollBy4` / `UnrollBy8`
- `Fuse(length, actions)`
Hot paths in AiDotNet that should adopt these
Most layers already go through `Engine.*` (which internally uses these). The real opportunity is in AiDotNet-owned CPU code that still has hand-rolled loops:
- `src/NeuralNetworks/Attention/FlashAttention.cs` (900 LOC) — block iteration over Q/KV blocks. Currently manual outer loop. Would benefit from `LoopOptimizer.Tile2D` + `LoopOptimizer.UnrollBy8` on the inner dot-product accumulator.
- `src/LinearAlgebra/*.cs` — any remaining hand-rolled matmul fallbacks.
- Custom loss function gradients that iterate element-by-element.
- Sparse / embedding layers with custom scatter/gather loops.
Suggested path
Per-file retrofit. Not a single atomic PR — each file's loops should be rewritten with care and benchmarked. `PerformanceProfiler` (now wired via `EnableTensorsOpProfiling()`) surfaces the bottlenecks to prioritize.
Estimated scope
~50-100 LOC per retrofit, ~10 hot files. Parallel-per-file.
Context
Tensors ships cache-aware + loop-aware helpers that AiDotNet's hand-rolled CPU code should use instead of rolling its own loops:
AiDotNet.Tensors.Engines.Optimization.CacheOptimizer
AiDotNet.Tensors.Engines.Optimization.LoopOptimizer
Hot paths in AiDotNet that should adopt these
Most layers already go through `Engine.*` (which internally uses these). The real opportunity is in AiDotNet-owned CPU code that still has hand-rolled loops:
Suggested path
Per-file retrofit. Not a single atomic PR — each file's loops should be rewritten with care and benchmarked. `PerformanceProfiler` (now wired via `EnableTensorsOpProfiling()`) surfaces the bottlenecks to prioritize.
Estimated scope
~50-100 LOC per retrofit, ~10 hot files. Parallel-per-file.