qgemm: optimize avxvnni QGEMM inner kernel for M=1 #22952

r-devulap · 2024-11-26T22:00:00Z

Add specialized path for M=1 case that exploits additional available ymm registers for deeper inner kernel loop unrolling.

Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K):

30% improvement in single threaded QGEMM kernels with M = 1
7% reduction in average inference time on small quantized model where all kernels have M=1

|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark                                                          | Time   | CPU     | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time   | -0.275 | -0.2756 | 4330     | 3137     | 4330    | 3136    |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time  | -0.292 | -0.2927 | 9027     | 6385     | 9027    | 6385    |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867    | 12499    | 17866   | 12498   |
| OVERALL_GEOMEAN                                                    | -0.289 | -0.2897 |          |          |         |         |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|

r-devulap · 2024-11-26T22:17:54Z

Posting raw qgemm benchmark numbers for clarity:

Before:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         4330 ns         4330 ns       161969
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        9027 ns         9027 ns        77210
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      17867 ns        17866 ns        39329
-------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         3137 ns         3136 ns       221932
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        6385 ns         6385 ns       109727
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      12499 ns        12498 ns        55934

r-devulap · 2024-12-08T09:46:47Z

Pushed a commit that should fix the 2 CI failures.

liqunfu · 2024-12-11T02:21:04Z

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

r-devulap · 2024-12-15T16:53:46Z

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

Not sure. It does have the extra spare registers but the AVX2 code path runs on CPU prior to Icelake (2019), so the CPU is much older technology and not sure if it will show the same benefits as the newer ones.

r-devulap · 2024-12-31T06:57:41Z

Friendly Ping.

liqunfu · 2025-01-09T19:56:56Z

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

Not sure. It does have the extra spare registers but the AVX2 code path runs on CPU prior to Icelake (2019), so the CPU is much older technology and not sure if it will show the same benefits as the newer ones.

I see this PR is to utilize extra registers not utilized when M=1. it is not specific to vnni or microprocessors.

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

r-devulap · 2025-01-22T17:48:03Z

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

I don't think this code is hard to maintain and I have reservations about spending time and effort in making this work for AVX2 for 2 reasons:

(1) Processors that exercise AVX2 code path are nearly 9 yrs old (Skylake generation). An average laptop/desktop lifespan is lesser than that.

(2) Even if we were to exercise the AVX2 code path with this special case for M = 1, I doubt we will see the performance benefits. Skylake generation has just 2 load ports (P2 and P3) and is very likely memory bound in that loop.

r-devulap · 2025-02-05T16:43:47Z

Friendly ping :)

liqunfu · 2025-02-06T01:55:03Z

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

I don't think this code is hard to maintain and I have reservations about spending time and effort in making this work for AVX2 for 2 reasons:

(1) Processors that exercise AVX2 code path are nearly 9 yrs old (Skylake generation). An average laptop/desktop lifespan is lesser than that.

(2) Even if we were to exercise the AVX2 code path with this special case for M = 1, I doubt we will see the performance benefits. Skylake generation has just 2 load ports (P2 and P3) and is very likely memory bound in that loop.

thank for the explanation!

r-devulap · 2025-02-11T04:55:23Z

Can someone trigger the CI, please?

r-devulap · 2025-02-20T20:55:47Z

Rebased with main.

hariharans29 · 2025-11-10T18:33:40Z

Hi @r-devulap
We are sorry we couldn't get to this earlier. Can you please re-base with main (that is needed to re-trigger CI) ? I will have this merged after that. Thanks.

r-devulap · 2025-11-10T19:37:30Z

@hariharans29 rebased.

hariharans29 · 2025-11-10T19:38:40Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-10T19:38:58Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-11-11T17:46:36Z

@snnn - Do you know why some Linux Minimal Build E2E jobs just seem to be stuck at "Starting job"? It was stuck for 15 hours. I just cancelled and re-tried but so far it still seems to be stuck in the same state.

hariharans29 · 2025-11-13T19:58:42Z

Hi @r-devulap: For some reason, some jobs are not finding runners. I see other PRs using forks work ok. FWIW- Could you please rebase again to see if it goes away after that ? Thanks.

r-devulap · 2025-11-14T17:12:25Z

rebased.

hariharans29 · 2025-11-14T19:42:17Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-14T19:42:35Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-11-14T20:48:52Z

I think it may need one more fix that just got merged to fix the failing x86 Windows pipeline: #26559

QGEMM Benchmarks when M = 1 on an 13th Gen Intel(R) Core(TM) i9-13900K shows a 1.4x improvement on a single thread. |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------| | QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time | -0.275 | -0.2756 | 4330 | 3137 | 4330 | 3136 | | QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time | -0.292 | -0.2927 | 9027 | 6385 | 9027 | 6385 | | QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867 | 12499 | 17866 | 12498 | | OVERALL_GEOMEAN | -0.289 | -0.2897 | | | | | |--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|

r-devulap · 2025-11-14T20:53:25Z

rebased.

hariharans29 · 2025-11-14T20:58:22Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-11-14T20:58:40Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2025-11-14T21:02:18Z

rebased.

Thanks for patiently rebasing all these times. Fingers crossed, all pipelines will be green this time.

r-devulap requested a review from a team as a code owner November 26, 2024 22:00

liqunfu approved these changes Feb 6, 2025

View reviewed changes

r-devulap force-pushed the avxvnni-unroll branch from 231f0fa to 38098e9 Compare February 20, 2025 20:55

r-devulap force-pushed the avxvnni-unroll branch from 38098e9 to f1a666c Compare November 10, 2025 19:37

r-devulap force-pushed the avxvnni-unroll branch from f1a666c to 9768177 Compare November 14, 2025 17:12

Raghuveer Devulapalli added 3 commits November 14, 2025 12:52

Add QGEMM benchmarks for M = 1 on a single thread

e41e01f

Add missing comma

19a7885

r-devulap force-pushed the avxvnni-unroll branch from 9768177 to 19a7885 Compare November 14, 2025 20:53

hariharans29 approved these changes Nov 14, 2025

View reviewed changes

hariharans29 enabled auto-merge (squash) November 14, 2025 22:59

hariharans29 merged commit d6a372a into microsoft:main Nov 15, 2025
90 checks passed

qgemm: optimize avxvnni QGEMM inner kernel for M=1 #22952

qgemm: optimize avxvnni QGEMM inner kernel for M=1 #22952

Conversation

r-devulap commented Nov 26, 2024

Uh oh!

r-devulap commented Nov 26, 2024

Uh oh!

r-devulap commented Dec 8, 2024

Uh oh!

liqunfu commented Dec 11, 2024

Uh oh!

r-devulap commented Dec 15, 2024

Uh oh!

r-devulap commented Dec 31, 2024

Uh oh!

liqunfu commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Jan 22, 2025

Uh oh!

r-devulap commented Feb 5, 2025

Uh oh!

liqunfu commented Feb 6, 2025

Uh oh!

r-devulap commented Feb 11, 2025

Uh oh!

r-devulap commented Feb 20, 2025

Uh oh!

hariharans29 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Nov 10, 2025

Uh oh!

hariharans29 commented Nov 10, 2025

Uh oh!

azure-pipelines bot commented Nov 10, 2025

Uh oh!

hariharans29 commented Nov 11, 2025

Uh oh!

hariharans29 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

azure-pipelines bot commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025

Uh oh!

azure-pipelines bot commented Nov 14, 2025

Uh oh!

hariharans29 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liqunfu commented Jan 9, 2025 •

edited

Loading

hariharans29 commented Nov 10, 2025 •

edited

Loading

hariharans29 commented Nov 13, 2025 •

edited

Loading

hariharans29 commented Nov 14, 2025 •

edited

Loading

hariharans29 commented Nov 14, 2025 •

edited

Loading