I tested the performance of gemm_a16w8 kernel on AMD MI200, and found the performance is worse than pytorch(rocmblas) and triton's gemm example (https://github.com/xiaonans/triton-gemm-benchmark/blob/main/03-matrix-multiplication.py), when M is large.
I attached my performance testing results below:

In my performance testing, I added some codes so that I can run autotune at the first time, and do benchmark with the saved best_config. The changes I made are main...xiaonans:FLASHNN:main. I run the test with python tests/quant_gemm/test_gemm_weight_only.py.
I want to ask whether my performance testing results are expected, or there is some thing I missed?
I tested the performance of gemm_a16w8 kernel on AMD MI200, and found the performance is worse than pytorch(rocmblas) and triton's gemm example (https://github.com/xiaonans/triton-gemm-benchmark/blob/main/03-matrix-multiplication.py), when M is large.
I attached my performance testing results below:

In my performance testing, I added some codes so that I can run autotune at the first time, and do benchmark with the saved best_config. The changes I made are main...xiaonans:FLASHNN:main. I run the test with
python tests/quant_gemm/test_gemm_weight_only.py.I want to ask whether my performance testing results are expected, or there is some thing I missed?