Context
Tensors 0.46.0 exposes high-level cuDNN / cuBLAS wrappers:
- `CuDnnConvolution.Conv2DForward(...)` (float)
- `CuDnnBatchNorm.ForwardInference(...)` (float)
- `CuBlasMatMul.MatMulFloat(...)` + `MatMulWithCachedWeightsFloat(...)`
These are the NVIDIA fast paths (tensor cores, cuBLAS-LT autotune). AiDotNet layers go through `Engine.Conv2D` / `Engine.BatchNorm` / `Engine.MatMul` which auto-dispatch to `DirectGpuEngine`. It is unclear from the public Tensors API whether `DirectGpuEngine.Conv2D` internally routes to `CuDnnConvolution.Conv2DForward` when CUDA + cuDNN are present, or whether it uses a generic kernel.
Ask
- Verify via instrumentation (or direct Tensors code inspection) whether the engine auto-routes through cuDNN/cuBLAS when they're available. If yes: document this and close.
- If not: wrap the three major layer ops (Conv2D, BatchNorm, MatMul) with an AiDotNet-side dispatcher:
```csharp
public static class GpuOptimalDispatch
{
public static Tensor Conv2D(Tensor input, Tensor kernel, ...)
{
if (CuDnnConvolution.IsAvailable)
return CuDnnConvolution.ForwardWithCache(input, kernel, ...); // needs new overload
return AiDotNetEngine.Current.Conv2D(input, kernel, ...);
}
// similar for BatchNorm, MatMul
}
```
The public `Engine.*` methods stay unchanged; `GpuOptimalDispatch` is used from `Conv2DLayer` / `BatchNormalizationLayer` / `DenseLayer` when `T == float` and cuDNN is available.
Acceptance
- Tests on an NVIDIA host verify `CuDnnConvolution.Conv2DForward` is actually invoked (can be asserted via `PerformanceProfiler` trace once `EnableTensorsOpProfiling()` is on).
- CPU-only hosts + non-NVIDIA GPUs fall through to existing paths with no change.
Relationship
Blocked on: nothing. Can be done as a follow-up after the current Tensors-parity PR lands.
Parallels: Tensors-side issue would be cleaner — Tensors' `DirectGpuEngine` should auto-route to cuDNN for Conv2D/BN when its backend is CUDA + cuDNN available. That's the real fix.
Context
Tensors 0.46.0 exposes high-level cuDNN / cuBLAS wrappers:
These are the NVIDIA fast paths (tensor cores, cuBLAS-LT autotune). AiDotNet layers go through `Engine.Conv2D` / `Engine.BatchNorm` / `Engine.MatMul` which auto-dispatch to `DirectGpuEngine`. It is unclear from the public Tensors API whether `DirectGpuEngine.Conv2D` internally routes to `CuDnnConvolution.Conv2DForward` when CUDA + cuDNN are present, or whether it uses a generic kernel.
Ask
```csharp
public static class GpuOptimalDispatch
{
public static Tensor Conv2D(Tensor input, Tensor kernel, ...)
{
if (CuDnnConvolution.IsAvailable)
return CuDnnConvolution.ForwardWithCache(input, kernel, ...); // needs new overload
return AiDotNetEngine.Current.Conv2D(input, kernel, ...);
}
// similar for BatchNorm, MatMul
}
```
The public `Engine.*` methods stay unchanged; `GpuOptimalDispatch` is used from `Conv2DLayer` / `BatchNormalizationLayer` / `DenseLayer` when `T == float` and cuDNN is available.
Acceptance
Relationship
Blocked on: nothing. Can be done as a follow-up after the current Tensors-parity PR lands.
Parallels: Tensors-side issue would be cleaner — Tensors' `DirectGpuEngine` should auto-route to cuDNN for Conv2D/BN when its backend is CUDA + cuDNN available. That's the real fix.