-
Notifications
You must be signed in to change notification settings - Fork 3.5k
qgemm: optimize avxvnni QGEMM inner kernel for M=1 #22952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Posting raw qgemm benchmark numbers for clarity: Before: After: |
|
Pushed a commit that should fix the 2 CI failures. |
|
will avx2 benefits from the same - given redundant registers in RowCount=1 cases? |
Not sure. It does have the extra spare registers but the AVX2 code path runs on CPU prior to Icelake (2019), so the CPU is much older technology and not sure if it will show the same benefits as the newer ones. |
|
Friendly Ping. |
I see this PR is to utilize extra registers not utilized when M=1. it is not specific to vnni or microprocessors. I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable. |
I don't think this code is hard to maintain and I have reservations about spending time and effort in making this work for AVX2 for 2 reasons: (1) Processors that exercise AVX2 code path are nearly 9 yrs old (Skylake generation). An average laptop/desktop lifespan is lesser than that. (2) Even if we were to exercise the AVX2 code path with this special case for M = 1, I doubt we will see the performance benefits. Skylake generation has just 2 load ports (P2 and P3) and is very likely memory bound in that loop. |
|
Friendly ping :) |
thank for the explanation! |
|
Can someone trigger the CI, please? |
231f0fa to
38098e9
Compare
|
Rebased with main. |
|
Hi @r-devulap |
38098e9 to
f1a666c
Compare
|
@hariharans29 rebased. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@snnn - Do you know why some Linux Minimal Build E2E jobs just seem to be stuck at "Starting job"? It was stuck for 15 hours. I just cancelled and re-tried but so far it still seems to be stuck in the same state. |
|
Hi @r-devulap: For some reason, some jobs are not finding runners. I see other PRs using forks work ok. FWIW- Could you please rebase again to see if it goes away after that ? Thanks. |
f1a666c to
9768177
Compare
|
rebased. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
I think it may need one more fix that just got merged to fix the failing x86 Windows pipeline: #26559 |
QGEMM Benchmarks when M = 1 on an 13th Gen Intel(R) Core(TM) i9-13900K shows a 1.4x improvement on a single thread. |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time | -0.275 | -0.2756 | 4330 | 3137 | 4330 | 3136 | | QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time | -0.292 | -0.2927 | 9027 | 6385 | 9027 | 6385 | | QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867 | 12499 | 17866 | 12498 | | OVERALL_GEOMEAN | -0.289 | -0.2897 | | | | | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
9768177 to
19a7885
Compare
|
rebased. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Thanks for patiently rebasing all these times. Fingers crossed, all pipelines will be green this time. |
Add specialized path for M=1 case that exploits additional available ymm registers for deeper inner kernel loop unrolling.
Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K):