Adopt Tensors' CacheOptimizer / LoopOptimizer in hand-rolled CPU hot paths

## Context

Tensors ships cache-aware + loop-aware helpers that AiDotNet's hand-rolled CPU code should use instead of rolling its own loops:

**AiDotNet.Tensors.Engines.Optimization.CacheOptimizer**
- \`ComputeOptimalTiling(m, n, k, elementSize)\` — matmul tile picker based on L1/L2/L3 sizes
- \`L1BlockSize\` / \`L2BlockSize\` / \`L3BlockSize\` — platform-detected sizes
- \`CopyWithPrefetch\` — software-prefetched copy
- \`MortonEncode\` / \`MortonDecode\` — space-filling curve for cache locality
- \`TransposeBlocked\` — cache-aware transpose
- \`EstimateCacheMisses\` — for pre-flight analysis

**AiDotNet.Tensors.Engines.Optimization.LoopOptimizer**
- \`DetermineOptimalTileSize(dim, elementSize)\`
- \`Tile2D\` / \`Tile3D\` / \`ParallelTile2D\` — correct tiling with closure
- \`OptimalOrder2D(rows, cols, rowMajorAccess, action)\` — picks row-or-column-major order based on access pattern
- \`StripMine(totalSize, stripSize, action)\`
- \`UnrollBy4\` / \`UnrollBy8\`
- \`Fuse(length, actions)\`

## Hot paths in AiDotNet that should adopt these

Most layers already go through \`Engine.*\` (which internally uses these). The real opportunity is in AiDotNet-owned CPU code that still has hand-rolled loops:

1. \`src/NeuralNetworks/Attention/FlashAttention.cs\` (900 LOC) — block iteration over Q/KV blocks. Currently manual outer loop. Would benefit from \`LoopOptimizer.Tile2D\` + \`LoopOptimizer.UnrollBy8\` on the inner dot-product accumulator.
2. \`src/LinearAlgebra/*.cs\` — any remaining hand-rolled matmul fallbacks.
3. Custom loss function gradients that iterate element-by-element.
4. Sparse / embedding layers with custom scatter/gather loops.

## Suggested path

Per-file retrofit. Not a single atomic PR — each file's loops should be rewritten with care and benchmarked. \`PerformanceProfiler\` (now wired via \`EnableTensorsOpProfiling()\`) surfaces the bottlenecks to prioritize.

## Estimated scope

~50-100 LOC per retrofit, ~10 hot files. Parallel-per-file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adopt Tensors' CacheOptimizer / LoopOptimizer in hand-rolled CPU hot paths #1158

Context

Hot paths in AiDotNet that should adopt these

Suggested path

Estimated scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Adopt Tensors' CacheOptimizer / LoopOptimizer in hand-rolled CPU hot paths #1158

Description

Context

Hot paths in AiDotNet that should adopt these

Suggested path

Estimated scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions