Skip to content

Conversation

@r-devulap
Copy link
Contributor

Add specialized path for M=1 case that exploits additional available ymm registers for deeper inner kernel loop unrolling.

Performance impact (measured on 13th Gen Intel(R) Core(TM) i9-13900K):

  • 30% improvement in single threaded QGEMM kernels with M = 1
  • 7% reduction in average inference time on small quantized model where all kernels have M=1
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark                                                          | Time   | CPU     | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time   | -0.275 | -0.2756 | 4330     | 3137     | 4330    | 3136    |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time  | -0.292 | -0.2927 | 9027     | 6385     | 9027    | 6385    |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867    | 12499    | 17866   | 12498   |
| OVERALL_GEOMEAN                                                    | -0.289 | -0.2897 |          |          |         |         |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|

@r-devulap r-devulap requested a review from a team as a code owner November 26, 2024 22:00
@r-devulap
Copy link
Contributor Author

Posting raw qgemm benchmark numbers for clarity:

Before:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         4330 ns         4330 ns       161969
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        9027 ns         9027 ns        77210
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      17867 ns        17866 ns        39329
-------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time         3137 ns         3136 ns       221932
QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time        6385 ns         6385 ns       109727
QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time      12499 ns        12498 ns        55934

@r-devulap
Copy link
Contributor Author

Pushed a commit that should fix the 2 CI failures.

@liqunfu
Copy link
Contributor

liqunfu commented Dec 11, 2024

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

@r-devulap
Copy link
Contributor Author

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

Not sure. It does have the extra spare registers but the AVX2 code path runs on CPU prior to Icelake (2019), so the CPU is much older technology and not sure if it will show the same benefits as the newer ones.

@r-devulap
Copy link
Contributor Author

Friendly Ping.

@liqunfu
Copy link
Contributor

liqunfu commented Jan 9, 2025

will avx2 benefits from the same - given redundant registers in RowCount=1 cases?

Not sure. It does have the extra spare registers but the AVX2 code path runs on CPU prior to Icelake (2019), so the CPU is much older technology and not sure if it will show the same benefits as the newer ones.

I see this PR is to utilize extra registers not utilized when M=1. it is not specific to vnni or microprocessors.

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

@r-devulap
Copy link
Contributor Author

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

I don't think this code is hard to maintain and I have reservations about spending time and effort in making this work for AVX2 for 2 reasons:

(1) Processors that exercise AVX2 code path are nearly 9 yrs old (Skylake generation). An average laptop/desktop lifespan is lesser than that.

(2) Even if we were to exercise the AVX2 code path with this special case for M = 1, I doubt we will see the performance benefits. Skylake generation has just 2 load ports (P2 and P3) and is very likely memory bound in that loop.

@r-devulap
Copy link
Contributor Author

Friendly ping :)

@liqunfu
Copy link
Contributor

liqunfu commented Feb 6, 2025

I am worried for the extra code path just for M1 VNNI, I am thinking if avx2 will also benefit from the extra registers there will be only one code path. That makes code more maintainable.

I don't think this code is hard to maintain and I have reservations about spending time and effort in making this work for AVX2 for 2 reasons:

(1) Processors that exercise AVX2 code path are nearly 9 yrs old (Skylake generation). An average laptop/desktop lifespan is lesser than that.

(2) Even if we were to exercise the AVX2 code path with this special case for M = 1, I doubt we will see the performance benefits. Skylake generation has just 2 load ports (P2 and P3) and is very likely memory bound in that loop.

thank for the explanation!

@r-devulap
Copy link
Contributor Author

Can someone trigger the CI, please?

@r-devulap
Copy link
Contributor Author

Rebased with main.

@hariharans29
Copy link
Member

hariharans29 commented Nov 10, 2025

Hi @r-devulap
We are sorry we couldn't get to this earlier. Can you please re-base with main (that is needed to re-trigger CI) ? I will have this merged after that. Thanks.

@r-devulap
Copy link
Contributor Author

@hariharans29 rebased.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

@snnn - Do you know why some Linux Minimal Build E2E jobs just seem to be stuck at "Starting job"? It was stuck for 15 hours. I just cancelled and re-tried but so far it still seems to be stuck in the same state.

@hariharans29
Copy link
Member

hariharans29 commented Nov 13, 2025

Hi @r-devulap: For some reason, some jobs are not finding runners. I see other PRs using forks work ok. FWIW- Could you please rebase again to see if it goes away after that ? Thanks.

@r-devulap
Copy link
Contributor Author

rebased.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

hariharans29 commented Nov 14, 2025

I think it may need one more fix that just got merged to fix the failing x86 Windows pipeline: #26559

Raghuveer Devulapalli added 3 commits November 14, 2025 12:52
QGEMM Benchmarks when M = 1 on an 13th Gen Intel(R) Core(TM) i9-13900K
shows a 1.4x improvement on a single thread.

|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| Benchmark                                                          | Time   | CPU     | Time Old | Time New | CPU Old | CPU New |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
| QGEMM/UnsignedAPackB/M:1/N:512/K:512/Batch:1/Threads:1/real_time   | -0.275 | -0.2756 | 4330     | 3137     | 4330    | 3136    |
| QGEMM/UnsignedAPackB/M:1/N:512/K:1024/Batch:1/Threads:1/real_time  | -0.292 | -0.2927 | 9027     | 6385     | 9027    | 6385    |
| QGEMM/UnsignedAPackB/M:1/N:1024/K:1024/Batch:1/Threads:1/real_time | -0.300 | -0.3005 | 17867    | 12499    | 17866   | 12498   |
| OVERALL_GEOMEAN                                                    | -0.289 | -0.2897 |          |          |         |         |
|--------------------------------------------------------------------+--------+---------+----------+----------+---------+---------|
@r-devulap
Copy link
Contributor Author

rebased.

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Member

hariharans29 commented Nov 14, 2025

rebased.

Thanks for patiently rebasing all these times. Fingers crossed, all pipelines will be green this time.

@hariharans29 hariharans29 enabled auto-merge (squash) November 14, 2025 22:59
@hariharans29 hariharans29 merged commit d6a372a into microsoft:main Nov 15, 2025
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants