Skip to content

Fix: profiling buffer recycling and implicit task record emission#419

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a2a3/pref_bug
Apr 1, 2026
Merged

Fix: profiling buffer recycling and implicit task record emission#419
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a2a3/pref_bug

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

Overhaul the profiling subsystem to fix buffer leaks, lost records, and excessive device memory allocation:

Buffer recycling (Host):

  • Replace alloc/free-per-cycle with closed-loop buffer pools: completed buffers are recycled into per-type pools (recycled_perf_buffers_, recycled_phase_buffers_) instead of being freed
  • Pre-allocate all PerfBuffers and PhaseBuffers at init time via PLATFORM_PROF_BUFFERS_PER_CORE and PLATFORM_PROF_BUFFERS_PER_THREAD, eliminating runtime rtMalloc calls entirely
  • Reduce PLATFORM_PROF_SLOT_COUNT from 8 to 4 (sufficient with recycling)
  • Process ready buffers during the expected_tasks wait phase to prevent device memory leaks when AICPU is slow to report total_tasks

Buffer recycling (AICPU):

  • Replace per-cycle rtMalloc/rtFree with pre-allocated buffer arrays and free-list indices, matching the Host-side closed-loop design
  • Pre-allocate phase buffers per thread and perf buffers per core at profiling init, recycling them on buffer switch

Implicit task profiling (AICPU):

  • When pipelining two tasks on the same core, AICore may transition from FIN(task_A) to ACK(task_B) before AICPU reads the register. Both implicit completion paths (pending FIN and pending ACK) counted the old task but never called perf_aicpu_complete_record, losing its profiling record
  • Add perf_aicpu_complete_record calls in both paths, completing the implicit task's record before the explicit task's to preserve buffer ordering

Other:

  • Reduce RUNTIME_MAX_FANOUT from 512 to 128 to match actual usage
  • Enable profiling setup in device_runner for both onboard and sim

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a closed-loop buffer recycling system for performance profiling to eliminate runtime memory allocations and reduce data loss. Key changes include the reduction of maximum fanout, the introduction of recycled buffer pools, proactive replenishment of free queues, and updated profiling logic in the AICPU executor to maintain record ordering. Review feedback identified a potential stack buffer overflow during fanout copying, inefficient mutex locking in replenishment loops, overly restrictive allocation fallback conditions, and missing bounds checks for thread indices.

@ChaoZheng109 ChaoZheng109 force-pushed the a2a3/pref_bug branch 6 times, most recently from cf3963e to be73eb9 Compare March 31, 2026 14:10
…cords

Buffer recycling:
- Replace alloc-per-swap with closed-loop buffer recycling in
  ProfMemoryManager. Completed buffers go to recycled pools instead of
  being freed, and process_ready_entry replenishes free_queues from
  recycled pool → done_queue drain → alloc (last resort).
- Pre-allocate PLATFORM_PROF_BUFFERS_PER_CORE / _PER_THREAD buffers at
  init, seeding 1 into each free_queue and the rest into recycled pools.
- Reduce PLATFORM_PROF_SLOT_COUNT from 8 to 4 (recycling makes deep
  slot rings unnecessary).
- Add proactive replenishment scan in mgmt_loop as safety net for
  depleted cores/threads.
- Free recycled buffers in ProfMemoryManager::stop().

Mgmt thread device context:
- Add PerfSetDeviceCallback (dependency-injected like existing alloc/
  register/free callbacks) so the mgmt thread can call rtSetDevice once
  at startup. Without this, rtMalloc fails on the mgmt thread because
  CANN device context is per-thread.
- Onboard device_runner passes rtSetDevice wrapper; sim passes nullptr.

Implicit task record collection:
- Process ready buffers during the expected_tasks wait phase to prevent
  device memory buildup.
- Add execution_complete signal so poll_and_collect exits promptly after
  stream synchronization instead of relying solely on record counts.
- Add scan_remaining_perf_buffers() to recover partial records from
  active buffers after device execution completes.

Housekeeping:
- Add copyright headers to platform_config.h, runtime.h, aicpu_executor.cpp
- Normalize include guard names to match file paths
- Replace C-style casts with reinterpret_cast in aicpu_executor.cpp
- Reduce RUNTIME_MAX_FANOUT from 512 to 128
- Fix formatting (alignment, line length) across touched files
@ChaoZheng109 ChaoZheng109 marked this pull request as ready for review April 1, 2026 00:56
@ChaoWao ChaoWao merged commit 012675a into hw-native-sys:main Apr 1, 2026
16 checks passed
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Apr 1, 2026
…ndency tracking

Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern.

Platform (src/a5/platform/):
- 012675a (hw-native-sys#419): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements
- fe63325 (hw-native-sys#403): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 012675a (hw-native-sys#419): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- 7059fff (hw-native-sys#389): Encapsulate TaskOutputTensors materialization in orchestrator
- 27a85c8 (hw-native-sys#390): Const-qualify set_tensor_data Tensor parameter
- 1d97ac5 (hw-native-sys#395): Arg inherits TaskArgs, simplify orchestrator arg passing
- 121a1d5 (hw-native-sys#387): Add to_u64/from_u64 type-safe conversion utilities in orchestration API
- fe63325 (hw-native-sys#403): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry
- cd59b47 (hw-native-sys#404): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext
- 4917d12 (hw-native-sys#415): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking
- 34a6e1c (hw-native-sys#417): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions

Tests (examples/a5/, tests/st/a5/):
- be765f1 (hw-native-sys#392): Migrate paged_attention orchestration to ChipStorageTaskArgs API
- 121a1d5 (hw-native-sys#387): Use from_u64<float> in softmax_prepare kernels
- fe63325 (hw-native-sys#403): Add license headers, NOLINT annotations, output tensor view(true)
- cd59b47 (hw-native-sys#404): Add spmd_basic example (AIC+AIV SPMD read test)
- 34a6e1c (hw-native-sys#417): Add spmd_multiblock_aiv and spmd_multiblock_mix examples
- Remove redundant end-of-kernel sync barriers in paged_attention test kernels
- Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Apr 1, 2026
…ndency tracking

Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern.

Platform (src/a5/platform/):
- 012675a ([hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements
- fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 012675a ([hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- 7059fff ([hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator
- 27a85c8 ([hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter
- 1d97ac5 ([hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing
- 121a1d5 ([hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API
- fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry
- cd59b47 ([hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext
- 4917d12 ([hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking
- 34a6e1c ([hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions

Tests (examples/a5/, tests/st/a5/):
- be765f1 ([hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API
- 121a1d5 ([hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels
- fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true)
- cd59b47 ([hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test)
- 34a6e1c ([hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples
- Remove redundant end-of-kernel sync barriers in paged_attention test kernels
- Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Apr 1, 2026
…ndency tracking

Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern.

Platform (src/a5/platform/):
- 012675a ([[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements
- fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 012675a ([[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- 7059fff ([[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator
- 27a85c8 ([[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter
- 1d97ac5 ([[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing
- 121a1d5 ([[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API
- fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry
- cd59b47 ([[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext
- 4917d12 ([[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking
- 34a6e1c ([[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions

Tests (examples/a5/, tests/st/a5/):
- be765f1 ([[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API
- 121a1d5 ([[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels
- fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true)
- cd59b47 ([[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test)
- 34a6e1c ([[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples
- Remove redundant end-of-kernel sync barriers in paged_attention test kernels
- Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Apr 1, 2026
…ndency tracking

Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern.

Platform (src/a5/platform/):
- 012675a ([[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements
- fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 012675a ([[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- 7059fff ([[[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator
- 27a85c8 ([[[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter
- 1d97ac5 ([[[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing
- 121a1d5 ([[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API
- fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry
- cd59b47 ([[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext
- 4917d12 ([[[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking
- 34a6e1c ([[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions

Tests (examples/a5/, tests/st/a5/):
- be765f1 ([[[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API
- 121a1d5 ([[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels
- fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true)
- cd59b47 ([[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test)
- 34a6e1c ([[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples
- Remove redundant end-of-kernel sync barriers in paged_attention test kernels
- Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Apr 1, 2026
…ndency tracking

Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern.

Platform (src/a5/platform/):
- 012675a ([[[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements
- fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 012675a ([[[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- 7059fff ([[[[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator
- 27a85c8 ([[[[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter
- 1d97ac5 ([[[[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing
- 121a1d5 ([[[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API
- fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry
- cd59b47 ([[[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext
- 4917d12 ([[[[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking
- 34a6e1c ([[[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions

Tests (examples/a5/, tests/st/a5/):
- be765f1 ([[[[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API
- 121a1d5 ([[[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels
- fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true)
- cd59b47 ([[[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test)
- 34a6e1c ([[[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples
- Remove redundant end-of-kernel sync barriers in paged_attention test kernels
- Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants