Fix: profiling buffer recycling and implicit task record emission#419
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom Apr 1, 2026
Merged
Fix: profiling buffer recycling and implicit task record emission#419ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao merged 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a closed-loop buffer recycling system for performance profiling to eliminate runtime memory allocations and reduce data loss. Key changes include the reduction of maximum fanout, the introduction of recycled buffer pools, proactive replenishment of free queues, and updated profiling logic in the AICPU executor to maintain record ordering. Review feedback identified a potential stack buffer overflow during fanout copying, inefficient mutex locking in replenishment loops, overly restrictive allocation fallback conditions, and missing bounds checks for thread indices.
cf3963e to
be73eb9
Compare
…cords Buffer recycling: - Replace alloc-per-swap with closed-loop buffer recycling in ProfMemoryManager. Completed buffers go to recycled pools instead of being freed, and process_ready_entry replenishes free_queues from recycled pool → done_queue drain → alloc (last resort). - Pre-allocate PLATFORM_PROF_BUFFERS_PER_CORE / _PER_THREAD buffers at init, seeding 1 into each free_queue and the rest into recycled pools. - Reduce PLATFORM_PROF_SLOT_COUNT from 8 to 4 (recycling makes deep slot rings unnecessary). - Add proactive replenishment scan in mgmt_loop as safety net for depleted cores/threads. - Free recycled buffers in ProfMemoryManager::stop(). Mgmt thread device context: - Add PerfSetDeviceCallback (dependency-injected like existing alloc/ register/free callbacks) so the mgmt thread can call rtSetDevice once at startup. Without this, rtMalloc fails on the mgmt thread because CANN device context is per-thread. - Onboard device_runner passes rtSetDevice wrapper; sim passes nullptr. Implicit task record collection: - Process ready buffers during the expected_tasks wait phase to prevent device memory buildup. - Add execution_complete signal so poll_and_collect exits promptly after stream synchronization instead of relying solely on record counts. - Add scan_remaining_perf_buffers() to recover partial records from active buffers after device execution completes. Housekeeping: - Add copyright headers to platform_config.h, runtime.h, aicpu_executor.cpp - Normalize include guard names to match file paths - Replace C-style casts with reinterpret_cast in aicpu_executor.cpp - Reduce RUNTIME_MAX_FANOUT from 512 to 128 - Fix formatting (alignment, line length) across touched files
be73eb9 to
728dd97
Compare
ChaoWao
approved these changes
Apr 1, 2026
ChaoZheng109
added a commit
to ChaoZheng109/simpler
that referenced
this pull request
Apr 1, 2026
…ndency tracking Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern. Platform (src/a5/platform/): - 012675a (hw-native-sys#419): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements - fe63325 (hw-native-sys#403): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 012675a (hw-native-sys#419): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - 7059fff (hw-native-sys#389): Encapsulate TaskOutputTensors materialization in orchestrator - 27a85c8 (hw-native-sys#390): Const-qualify set_tensor_data Tensor parameter - 1d97ac5 (hw-native-sys#395): Arg inherits TaskArgs, simplify orchestrator arg passing - 121a1d5 (hw-native-sys#387): Add to_u64/from_u64 type-safe conversion utilities in orchestration API - fe63325 (hw-native-sys#403): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry - cd59b47 (hw-native-sys#404): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext - 4917d12 (hw-native-sys#415): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking - 34a6e1c (hw-native-sys#417): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions Tests (examples/a5/, tests/st/a5/): - be765f1 (hw-native-sys#392): Migrate paged_attention orchestration to ChipStorageTaskArgs API - 121a1d5 (hw-native-sys#387): Use from_u64<float> in softmax_prepare kernels - fe63325 (hw-native-sys#403): Add license headers, NOLINT annotations, output tensor view(true) - cd59b47 (hw-native-sys#404): Add spmd_basic example (AIC+AIV SPMD read test) - 34a6e1c (hw-native-sys#417): Add spmd_multiblock_aiv and spmd_multiblock_mix examples - Remove redundant end-of-kernel sync barriers in paged_attention test kernels - Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109
added a commit
to ChaoZheng109/simpler
that referenced
this pull request
Apr 1, 2026
…ndency tracking Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern. Platform (src/a5/platform/): - 012675a ([hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements - fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 012675a ([hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - 7059fff ([hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator - 27a85c8 ([hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter - 1d97ac5 ([hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing - 121a1d5 ([hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API - fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry - cd59b47 ([hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext - 4917d12 ([hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking - 34a6e1c ([hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions Tests (examples/a5/, tests/st/a5/): - be765f1 ([hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API - 121a1d5 ([hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels - fe63325 ([hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true) - cd59b47 ([hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test) - 34a6e1c ([hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples - Remove redundant end-of-kernel sync barriers in paged_attention test kernels - Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109
added a commit
to ChaoZheng109/simpler
that referenced
this pull request
Apr 1, 2026
…ndency tracking Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern. Platform (src/a5/platform/): - 012675a ([[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements - fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 012675a ([[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - 7059fff ([[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator - 27a85c8 ([[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter - 1d97ac5 ([[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing - 121a1d5 ([[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API - fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry - cd59b47 ([[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext - 4917d12 ([[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking - 34a6e1c ([[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions Tests (examples/a5/, tests/st/a5/): - be765f1 ([[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API - 121a1d5 ([[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels - fe63325 ([[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true) - cd59b47 ([[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test) - 34a6e1c ([[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples - Remove redundant end-of-kernel sync barriers in paged_attention test kernels - Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109
added a commit
to ChaoZheng109/simpler
that referenced
this pull request
Apr 1, 2026
…ndency tracking Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern. Platform (src/a5/platform/): - 012675a ([[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements - fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 012675a ([[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - 7059fff ([[[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator - 27a85c8 ([[[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter - 1d97ac5 ([[[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing - 121a1d5 ([[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API - fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry - cd59b47 ([[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext - 4917d12 ([[[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking - 34a6e1c ([[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions Tests (examples/a5/, tests/st/a5/): - be765f1 ([[[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API - 121a1d5 ([[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels - fe63325 ([[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true) - cd59b47 ([[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test) - 34a6e1c ([[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples - Remove redundant end-of-kernel sync barriers in paged_attention test kernels - Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
ChaoZheng109
added a commit
to ChaoZheng109/simpler
that referenced
this pull request
Apr 1, 2026
…ndency tracking Synchronize A5 platform, runtimes, and tests with a2a3 improvements. Follows the established sync pattern. Platform (src/a5/platform/): - 012675a ([[[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add PerfSetDeviceCallback for device context setup in mgmt_loop, buffer recycling improvements - fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Fix include paths in perf_profiling.h, rename params_cycle to args_cycle Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 012675a ([[[[hw-native-sys#419](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)](https://github.com/ChaoZheng109/simpler/issues/419)): Add implicit task profiling records in Case 1/Case 2, reinterpret_cast cleanup, license header Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - 7059fff ([[[[hw-native-sys#389](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)](https://github.com/ChaoZheng109/simpler/issues/389)): Encapsulate TaskOutputTensors materialization in orchestrator - 27a85c8 ([[[[hw-native-sys#390](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)](https://github.com/ChaoZheng109/simpler/issues/390)): Const-qualify set_tensor_data Tensor parameter - 1d97ac5 ([[[[hw-native-sys#395](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)](https://github.com/ChaoZheng109/simpler/issues/395)): Arg inherits TaskArgs, simplify orchestrator arg passing - 121a1d5 ([[[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Add to_u64/from_u64 type-safe conversion utilities in orchestration API - fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Defer output tensor materialization, TensorCreateInfo pointer, tensormap link_entry - cd59b47 ([[[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add SPMD context accessors, intrinsic.h, build_payload with LocalContext/GlobalContext - 4917d12 ([[[[hw-native-sys#415](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)](https://github.com/ChaoZheng109/simpler/issues/415)): Refine tensor dependency tracking, TensorCreateInfo alignment, owner tracking - 34a6e1c ([[[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): SPMD multi-block dispatch, scheduler dual-queue, submit_types extensions Tests (examples/a5/, tests/st/a5/): - be765f1 ([[[[hw-native-sys#392](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)](https://github.com/ChaoZheng109/simpler/issues/392)): Migrate paged_attention orchestration to ChipStorageTaskArgs API - 121a1d5 ([[[[hw-native-sys#387](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)](https://github.com/ChaoZheng109/simpler/issues/387)): Use from_u64<float> in softmax_prepare kernels - fe63325 ([[[[hw-native-sys#403](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)](https://github.com/ChaoZheng109/simpler/issues/403)): Add license headers, NOLINT annotations, output tensor view(true) - cd59b47 ([[[[hw-native-sys#404](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)](https://github.com/ChaoZheng109/simpler/issues/404)): Add spmd_basic example (AIC+AIV SPMD read test) - 34a6e1c ([[[[hw-native-sys#417](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)](https://github.com/ChaoZheng109/simpler/issues/417)): Add spmd_multiblock_aiv and spmd_multiblock_mix examples - Remove redundant end-of-kernel sync barriers in paged_attention test kernels - Adjust paged_attention orch_thread_num (2→1), paged_attention_unroll block_dim (36→24)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overhaul the profiling subsystem to fix buffer leaks, lost records, and excessive device memory allocation:
Buffer recycling (Host):
Buffer recycling (AICPU):
Implicit task profiling (AICPU):
Other: