-
Notifications
You must be signed in to change notification settings - Fork 3.6k
webgpu: optimize Gemm and MatMul using subgroup feature #26433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@guschmue @fs-eire @qjia7 @jchen10 @Jiawei-Shao
|
|
@xhcao It seems you didn't include your test cases in this PR. What's your concern? BTW, you'd better provide some performance data for reference! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Intel-specific optimizations for GEMM and MatMul operations in the WebGPU provider, targeting Intel Xe-2 GPU architectures (xe-2lpg and xe-2hpg) using subgroup operations.
- Implements subgroup-based GEMM and MatMul kernels optimized for Intel GPUs
- Adds vendor-specific code paths that are conditionally executed based on adapter info and matrix dimensions
- Integrates Intel optimizations into existing GEMM and MatMul operators
Reviewed Changes
Copilot reviewed 2 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| matmul_intel.h | Defines MatMulSubgroupProgram class and Intel MatMul entry points |
| matmul_intel.cc | Implements Intel-optimized MatMul with subgroup support and batched optimizations |
| gemm_utils_intel.h | Declares helper functions for reading/writing matrix data in Intel kernels |
| gemm_utils_intel.cc | Implements shader generation helpers for Intel-specific GEMM operations |
| gemm_subgroup.h | Defines constants and functions for subgroup-based matrix operations |
| gemm_subgroup.cc | Implements vec4 and scalar variants of subgroup matrix multiplication |
| gemm_intel.h | Defines GemmSubgroupProgram class and Intel GEMM entry points |
| gemm_intel.cc | Implements Intel-optimized GEMM with subgroup support |
| matmul.cc | Integrates Intel MatMul optimization check into existing MatMul operator |
| gemm.cc | Integrates Intel GEMM optimization check into existing GEMM operator |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Firstly, I want being consented to create a |
You don't need to depend on it. You can just best improve your PR with more test cases, better perf data, and easier to review. |
|
could you please help to merge to latest main and push? The pipeline failures should be unrelated, and may be fixed by rerun. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
I understand the desire for a vendor directory. But we can try and see how that works, just need to be careful with the testing |
|
@guschmue Thank you for your understanding and support. We will fully test our changes and closely monitor the status on LNL/BMG. Once mature enough we could roll out to more devices. |
fs-eire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is generally good to me. But I think there needs a few changes for code maintenance consideration:
- we can keep the directory structure (onnxruntime/core/providers/webgpu/vendor/intel), but there is no need to add
_intelsuffix for files since it is already in the file's full path (folder name) - Use a sub-namespace for
intelinstead of including them in function/class name (there are exceptions, see 3), for example:
- use namespace like this:
namespace onnxruntime { namespace webgpu { namespace intel { ...
MatMulReadFnSourceIntel(...)-->intel::MatMulReadFnSource(...)
- For
CanApplyXXXIntelorApplyXXXIntel, it is OK to keep it as-is, because this specific functions are considered as "entry" to vendor specific code.
|
Haven't closely look at all code. But several high level comments.
|
|
Attach shder files |
Description
Motivation and Context