Skip to content

Add: paged attention unroll scene test with 4D input shapes#380

Draft
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add-paged-attention-unroll-4dims-st
Draft

Add: paged attention unroll scene test with 4D input shapes#380
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add-paged-attention-unroll-4dims-st

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

  • New paged_attention_unroll_4dims test under tensormap_and_ringbuffer
  • Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim)
  • 6 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV), hub stubs
  • Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation
  • Golden wraps shared paged_attention_golden with 4D reshape adapter
  • Three test cases: varying batch/heads/head_dim at production scale (bfloat16)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a paged attention implementation with unrolling support, featuring specialized AIC kernels for matrix multiplications (QK and PV) and AIV kernels for vector operations (Softmax and Online Update). It also includes an orchestration layer to manage task scheduling and memory layout. Review feedback suggests improving the robustness of the golden computation logic in golden.py to avoid potential issues with tensor reshaping and correcting a documentation inconsistency regarding the N_UNROLL value in the orchestration code.

- New paged_attention_unroll_4dims test under tensormap_and_ringbuffer
- Query and output tensors use 4D format (batch, seq_len, num_heads, head_dim)
- 6 kernels: QK/PV matmul (AIC), softmax_prepare/online_update (AIV), hub stubs
- Orchestration with N_UNROLL=64, 4 tasks per group, online softmax accumulation
- Golden wraps shared paged_attention_golden with 4D reshape adapter
- Three test cases: varying batch/heads/head_dim at production scale (bfloat16)
@chenshengxin2026 chenshengxin2026 force-pushed the add-paged-attention-unroll-4dims-st branch from 2125ab0 to d8ccf87 Compare March 27, 2026 08:35
uint32_t value_cache_shapes[4] = {(uint32_t)total_blocks_count, (uint32_t)block_size, (uint32_t)kv_head_num, (uint32_t)head_dim};
uint32_t out_shapes[4] = {(uint32_t)batch, (uint32_t)seq_len, (uint32_t)num_heads, (uint32_t)head_dim};

Tensor query = make_tensor_external(query_ptr, query_shapes, 4, data_type, false);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果这里的shape和golden返回的是一致的,可直接调用to_tensor()接口

@ChaoWao ChaoWao marked this pull request as draft March 30, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants