Skip to content

Conversation

@fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Nov 12, 2025

Description

WebGPU EP does not support MHA's qk output yet. This PR makes it handling qk output correctly.

@fs-eire fs-eire changed the title [WIP][webgpu] add support of output_qk for MHA [webgpu] add support of output_qk for MHA Nov 15, 2025
@fs-eire fs-eire marked this pull request as ready for review November 15, 2025 00:14
@fs-eire fs-eire requested a review from Copilot November 15, 2025 00:14
Copilot finished reviewing on behalf of fs-eire November 15, 2025 00:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for the output_qk parameter in the MultiHeadAttention operator for the WebGPU execution provider. The output_qk output contains the scaled Q*K^T attention scores (before softmax), which is useful for debugging and visualization purposes.

Key Changes

  • Modified ComputeContext to add a CopyTensor method for copying tensor data, requiring an OpKernel reference in the constructor
  • Updated ApplyAttention function signature to accept an optional output_qk parameter
  • Modified MultiHeadAttention to compute and output QK scores when requested, with flash attention disabled when QK output is needed

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
compute_context.h Added CopyTensor method and updated constructor to accept OpKernel reference
compute_context.cc Updated constructor implementation to match header changes
multihead_attention.cc Added logic to create output_qk tensor and conditionally disable flash attention when QK output is requested
group_query_attention.cc Passed nullptr for output_qk parameter as GQA doesn't support this feature
attention_common.h Updated ApplyAttention function signature to include output_qk parameter
attention.cc Implemented QK score copying logic and updated function signatures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Nov 15, 2025

@fs-eire I've opened a new pull request, #26582, to work on those changes. Once the pull request is ready, I'll request review from you.

### Description

Fixed compilation error in `webgpu_kernel.cc` where `ComputeContext`
constructor call was missing the required `OpKernel` parameter.

**Change:**
```cpp
// Before (compilation error - 3 params, needs 4)
ComputeContext context{*p_op_kernel_context, ep_, webgpu_context};

// After (correct)
ComputeContext context{*p_op_kernel_context, *this, ep_, webgpu_context};
```

The constructor signature requires the `OpKernel` reference as the
second parameter. Since `WebGpuKernel` inherits from `OpKernel`, passing
`*this` satisfies the requirement. The `op_kernel_` member is used
internally by `CopyTensor()` to access the data transfer manager.

### Motivation and Context

Addresses review feedback from
#26553 (comment)
which identified the missing parameter that would cause a compilation
failure.

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: fs-eire <[email protected]>
@fs-eire fs-eire requested a review from Copilot November 15, 2025 01:21
Copilot finished reviewing on behalf of fs-eire November 15, 2025 01:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TensorShape output_qk_shape(output_qk_dims);
Tensor* output_qk = context.Output(3, output_qk_shape);

if (output_qk == nullptr && // Flash attention does not output QK scores
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Extra space between nullptr && should be removed for consistent formatting.

Suggested change
if (output_qk == nullptr && // Flash attention does not output QK scores
if (output_qk == nullptr && // Flash attention does not output QK scores

Copilot uses AI. Check for mistakes.
parameters, past_sequence_length, total_sequence_length, seqlen_k));

if (output_qk != nullptr) {
// Copy the attention scores (scaled Q*K^T) to output_qk
Copy link

Copilot AI Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incomplete. According to line 172-175 of the ComputeAttentionProbs function, the probs tensor contains sum * uniforms.alpha + attention_bias[outputIdx] when attention_bias is present, not just scaled QK^T. The comment should be updated to reflect this: 'Copy the attention scores (scaled QK^T with attention_bias if present) to output_qk' or 'Copy the raw attention scores (before softmax) to output_qk'.

Suggested change
// Copy the attention scores (scaled Q*K^T) to output_qk
// Copy the raw attention scores (scaled Q*K^T plus attention_bias if present, before softmax) to output_qk

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants