Performance problem of gemm_a16w8

I tested the performance of gemm_a16w8 kernel on AMD MI200, and found the performance is worse than pytorch(rocmblas) and triton's gemm example (https://github.com/xiaonans/triton-gemm-benchmark/blob/main/03-matrix-multiplication.py), when M is large.

I attached my performance testing results below:
![image](https://github.com/user-attachments/assets/564013d9-4493-4a03-bf73-a0abeaeca894)

In my performance testing, I added some codes so that I can run autotune at the first time, and do benchmark with the saved best_config. The changes I made are https://github.com/AlibabaPAI/FLASHNN/compare/main...xiaonans:FLASHNN:main. I run the test with `python tests/quant_gemm/test_gemm_weight_only.py`.

I want to ask whether my performance testing results are expected, or there is some thing I missed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problem of gemm_a16w8 #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance problem of gemm_a16w8 #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions