Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive framework for distributed (multi-rank) kernel execution, enabling the development and testing of collective operations across multiple devices. It includes a new backend-neutral C API for inter-rank communication, a Python-based distributed test runner, and enhancements to the CI script to support multi-device task scheduling. Additionally, it adds support for asynchronous task completion in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-architected implementation for asynchronous and distributed test execution. The changes include a new distributed test runner, a backend-neutral communication API with both HCCL and simulation implementations, and a robust mechanism for handling asynchronous task completion in the scheduler. The addition of comprehensive examples for different runtimes is also a great contribution.
My review found a few minor opportunities for code cleanup by removing unused variables in the new kernel files. The provided rule regarding cache line size for structs does not apply to these comments. Overall, the changes are of high quality and significantly enhance the testing capabilities of the project.
examples/a2a3/aicpu_build_graph/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp
Outdated
Show resolved
Hide resolved
examples/a2a3/host_build_graph/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp
Outdated
Show resolved
Hide resolved
examples/a2a3/tensormap_and_ringbuffer/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
Outdated
Show resolved
Hide resolved
ff85027 to
14b9757
Compare
- Introduced async_completion_demo with producer-consumer model for single and two-card hardware paths. - Implemented async_notify_demo utilizing TNOTIFY for inter-rank notifications, ensuring consumer launch-gating based on notification counters. - Added kernel configurations and orchestration for both demos, supporting distributed execution. - Created golden scripts for input generation and result validation in both demos. - Enhanced distributed code runner to facilitate multi-card kernel execution.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
- Add the runtime async design note as local reference material - Add the PR alignment note so later refactor design changes are easy to compare against the current branch state
- Record the problems with scheduler-side pre-launch gating - Compare launch-gating and explicit wait-task designs for notification waits - Recommend converging on deferred completion with a proxy wait task - Call out the required counter-polling cache-invalidation fix for remote notify
- Explain that tensormap_and_ringbuffer needs a token tensor because dependencies are discovered through TensorMap producer edges - Add orch/worker API sketches for async_notify_demo and allreduce/barrier flows - Note briefly that aicpu_build_graph can use explicit wait-task dependencies instead
|
I updated the design note in Key points:
One implementation detail I called out explicitly: if we move notify waiting onto deferred |
- Introduced a new MOE Dispatch V2 example featuring an 8-rank multi-expert dispatch system. - Added orchestration for a 3-phase task DAG: Prepare, Send, and RecvAssemble. - Implemented kernels for each phase, including token routing, data sending, and cumulative assembly. - Created supporting Python scripts for input generation and validation. - Enhanced runtime with updated completion handling and notification mechanisms. - Refactored existing completion APIs to unify handling of counter-based completions.
…dates - Updated the async_notify_demo to include a new NotifyWait kernel that registers a notification counter condition for inter-rank synchronization. - Modified the consumer kernel to depend on the completion of NotifyWait, ensuring it only executes when the notification counter is satisfied. - Enhanced orchestration logic to incorporate the NotifyWait phase, allowing for a more robust task dependency management. - Refactored kernel argument layouts to accommodate the new dependency token from NotifyWait. - Improved runtime handling by removing legacy notification wait mechanisms, streamlining the completion process.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
Outdated
Show resolved
Hide resolved
...sormap_and_ringbuffer/async_notify_demo/kernels/orchestration/async_notify_orchestration.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
- Remove preemptive flush_deferred_releases guard and unused lambda from executor loop; rely on existing inline flush-on-full and idle-batch-flush paths (reviewer: poursoul) - Clarify cache_invalidate_range comment: all current counter writers (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so invalidation is always required (reviewer: uv-xiao) - Add pto2_rt_submit_notification_wait_task() helper API to pto_orchestration_api.h, reducing NotifyWait boilerplate in orchestration code (reviewer: uv-xiao) - Simplify async_notify_demo and moe_dispatch orchestration to use the new helper API - Remove unused PTO2LocalReadyBuffer forward declaration (reviewer: uv-xiao) Made-with: Cursor
- Remove preemptive flush_deferred_releases guard and unused lambda from executor loop; rely on existing inline flush-on-full and idle-batch-flush paths (reviewer: poursoul) - Clarify cache_invalidate_range comment: all current counter writers (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so invalidation is always required (reviewer: uv-xiao) - Add pto2_rt_submit_notification_wait_task() helper API to pto_orchestration_api.h, reducing NotifyWait boilerplate in orchestration code (reviewer: uv-xiao) - Simplify async_notify_demo and moe_dispatch orchestration to use the new helper API - Remove unused PTO2LocalReadyBuffer forward declaration (reviewer: uv-xiao) Made-with: Cursor
8200078 to
45e631c
Compare
- Remove preemptive flush_deferred_releases guard and unused lambda from executor loop; rely on existing inline flush-on-full and idle-batch-flush paths (reviewer: poursoul) - Clarify cache_invalidate_range comment: all current counter writers (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so invalidation is always required (reviewer: uv-xiao) - Add pto2_rt_submit_notification_wait_task() helper API to pto_orchestration_api.h, reducing NotifyWait boilerplate in orchestration code (reviewer: uv-xiao) - Simplify async_notify_demo and moe_dispatch orchestration to use the new helper API - Remove unused PTO2LocalReadyBuffer forward declaration (reviewer: uv-xiao) Made-with: Cursor
uv-xiao
left a comment
There was a problem hiding this comment.
src/a2a3/runtime/tensormap_and_ringbuffer/distribute is not live yet. I don't have other comments.
| Run the two async distributed hardware test cases: | ||
| 1. async_completion_demo | ||
| 2. async_notify_demo |
| HCCL / sim communication and the existing PTO runtime C API for kernel | ||
| execution. | ||
|
|
||
| Spawned by DistributedCodeRunner — not intended for direct invocation. |
There was a problem hiding this comment.
Should we mark both distributed_code_runner.py and distributed_worker.py specially? They stand aside the run_example.py but they shouldn't be directly ruined. The current file placement is kind of confusing.
Async_runtime implementation