[webgpu] add support of output_qk for MHA #26553

fs-eire · 2025-11-12T02:31:26Z

Description

WebGPU EP does not support MHA's qk output yet. This PR makes it handling qk output correctly.

Copilot

Pull Request Overview

This pull request adds support for the output_qk parameter in the MultiHeadAttention operator for the WebGPU execution provider. The output_qk output contains the scaled Q*K^T attention scores (before softmax), which is useful for debugging and visualization purposes.

Key Changes

Modified ComputeContext to add a CopyTensor method for copying tensor data, requiring an OpKernel reference in the constructor
Updated ApplyAttention function signature to accept an optional output_qk parameter
Modified MultiHeadAttention to compute and output QK scores when requested, with flash attention disabled when QK output is needed

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`compute_context.h`	Added `CopyTensor` method and updated constructor to accept `OpKernel` reference
`compute_context.cc`	Updated constructor implementation to match header changes
`multihead_attention.cc`	Added logic to create `output_qk` tensor and conditionally disable flash attention when QK output is requested
`group_query_attention.cc`	Passed `nullptr` for `output_qk` parameter as GQA doesn't support this feature
`attention_common.h`	Updated `ApplyAttention` function signature to include `output_qk` parameter
`attention.cc`	Implemented QK score copying logic and updated function signatures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/webgpu/compute_context.h

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.cc

Copilot · 2025-11-15T01:06:05Z

@fs-eire I've opened a new pull request, #26582, to work on those changes. Once the pull request is ready, I'll request review from you.

### Description Fixed compilation error in `webgpu_kernel.cc` where `ComputeContext` constructor call was missing the required `OpKernel` parameter. **Change:** ```cpp // Before (compilation error - 3 params, needs 4) ComputeContext context{*p_op_kernel_context, ep_, webgpu_context}; // After (correct) ComputeContext context{*p_op_kernel_context, *this, ep_, webgpu_context}; ``` The constructor signature requires the `OpKernel` reference as the second parameter. Since `WebGpuKernel` inherits from `OpKernel`, passing `*this` satisfies the requirement. The `op_kernel_` member is used internally by `CopyTensor()` to access the data transfer manager. ### Motivation and Context Addresses review feedback from #26553 (comment) which identified the missing parameter that would cause a compilation failure.  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: fs-eire <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-15T01:24:15Z

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.cc

+  TensorShape output_qk_shape(output_qk_dims);
+  Tensor* output_qk = context.Output(3, output_qk_shape);
+
+  if (output_qk == nullptr &&  // Flash attention does not output QK scores


[nitpick] Extra space between nullptr && should be removed for consistent formatting.

Suggested change

if (output_qk == nullptr && // Flash attention does not output QK scores

if (output_qk == nullptr && // Flash attention does not output QK scores

Copilot · 2025-11-15T01:24:15Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

                                            parameters, past_sequence_length, total_sequence_length, seqlen_k));

+  if (output_qk != nullptr) {
+    // Copy the attention scores (scaled Q*K^T) to output_qk


The comment is incomplete. According to line 172-175 of the ComputeAttentionProbs function, the probs tensor contains sum * uniforms.alpha + attention_bias[outputIdx] when attention_bias is present, not just scaled QK^T. The comment should be updated to reflect this: 'Copy the attention scores (scaled QK^T with attention_bias if present) to output_qk' or 'Copy the raw attention scores (before softmax) to output_qk'.

Suggested change

// Copy the attention scores (scaled Q*K^T) to output_qk

// Copy the raw attention scores (scaled Q*K^T plus attention_bias if present, before softmax) to output_qk

[webgpu] add support of output_qk for MHA

7d0139a

fs-eire force-pushed the fs-eire/mha-qk branch from b6e6d54 to 7d0139a Compare November 15, 2025 00:13

fs-eire changed the title ~~[WIP][webgpu] add support of output_qk for MHA~~ [webgpu] add support of output_qk for MHA Nov 15, 2025

fs-eire marked this pull request as ready for review November 15, 2025 00:14

fs-eire requested a review from Copilot November 15, 2025 00:14

Copilot started reviewing on behalf of fs-eire November 15, 2025 00:14 View session

Copilot finished reviewing on behalf of fs-eire November 15, 2025 00:17

Copilot AI reviewed Nov 15, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/compute_context.h Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/multihead_attention.cc Show resolved Hide resolved

Copilot AI mentioned this pull request Nov 15, 2025

Fix ComputeContext constructor call missing OpKernel parameter #26582

Merged

fs-eire requested a review from Copilot November 15, 2025 01:21

Copilot started reviewing on behalf of fs-eire November 15, 2025 01:22 View session

Copilot finished reviewing on behalf of fs-eire November 15, 2025 01:23

Copilot AI reviewed Nov 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] add support of output_qk for MHA #26553

[webgpu] add support of output_qk for MHA #26553

Uh oh!

fs-eire commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Nov 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Copilot AI Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (output_qk == nullptr && // Flash attention does not output QK scores
	if (output_qk == nullptr && // Flash attention does not output QK scores

	// Copy the attention scores (scaled Q*K^T) to output_qk
	// Copy the raw attention scores (scaled Q*K^T plus attention_bias if present, before softmax) to output_qk

[webgpu] add support of output_qk for MHA #26553

Are you sure you want to change the base?

[webgpu] add support of output_qk for MHA #26553

Uh oh!

Conversation

fs-eire commented Nov 12, 2025

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Nov 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants