Verify + document cuDNN/cuBLAS dispatch path on NVIDIA hosts

## Context

Tensors 0.46.0 exposes high-level cuDNN / cuBLAS wrappers:

- \`CuDnnConvolution.Conv2DForward(...)\` (float)
- \`CuDnnBatchNorm.ForwardInference(...)\` (float)
- \`CuBlasMatMul.MatMulFloat(...)\` + \`MatMulWithCachedWeightsFloat(...)\`

These are the NVIDIA fast paths (tensor cores, cuBLAS-LT autotune). AiDotNet layers go through \`Engine.Conv2D\` / \`Engine.BatchNorm\` / \`Engine.MatMul\` which auto-dispatch to \`DirectGpuEngine\`. It is unclear from the public Tensors API whether \`DirectGpuEngine.Conv2D\` internally routes to \`CuDnnConvolution.Conv2DForward\` when CUDA + cuDNN are present, or whether it uses a generic kernel.

## Ask

1. **Verify** via instrumentation (or direct Tensors code inspection) whether the engine auto-routes through cuDNN/cuBLAS when they're available. If yes: document this and close.
2. **If not**: wrap the three major layer ops (Conv2D, BatchNorm, MatMul) with an AiDotNet-side dispatcher:

\`\`\`csharp
public static class GpuOptimalDispatch
{
    public static Tensor<float> Conv2D(Tensor<float> input, Tensor<float> kernel, ...)
    {
        if (CuDnnConvolution.IsAvailable)
            return CuDnnConvolution.ForwardWithCache(input, kernel, ...);  // needs new overload
        return AiDotNetEngine.Current.Conv2D(input, kernel, ...);
    }
    // similar for BatchNorm, MatMul
}
\`\`\`

The public \`Engine.*\` methods stay unchanged; \`GpuOptimalDispatch\` is used from \`Conv2DLayer\` / \`BatchNormalizationLayer\` / \`DenseLayer\` when \`T == float\` and cuDNN is available.

## Acceptance

- Tests on an NVIDIA host verify \`CuDnnConvolution.Conv2DForward\` is actually invoked (can be asserted via \`PerformanceProfiler\` trace once \`EnableTensorsOpProfiling()\` is on).
- CPU-only hosts + non-NVIDIA GPUs fall through to existing paths with no change.

## Relationship

Blocked on: nothing. Can be done as a follow-up after the current Tensors-parity PR lands.
Parallels: Tensors-side issue would be cleaner — Tensors' \`DirectGpuEngine\` should auto-route to cuDNN for Conv2D/BN when its backend is CUDA + cuDNN available. That's the real fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Verify + document cuDNN/cuBLAS dispatch path on NVIDIA hosts #1159

Context

Ask

Acceptance

Relationship

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Verify + document cuDNN/cuBLAS dispatch path on NVIDIA hosts #1159

Description

Context

Ask

Acceptance

Relationship

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions