Skip to content

Async distributed runtime #334

Open
PKUZHOU wants to merge 9 commits intohw-native-sys:mainfrom
PKUZHOU:async_dev
Open

Async distributed runtime #334
PKUZHOU wants to merge 9 commits intohw-native-sys:mainfrom
PKUZHOU:async_dev

Conversation

@PKUZHOU
Copy link
Copy Markdown

@PKUZHOU PKUZHOU commented Mar 20, 2026

Async_runtime implementation

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive framework for distributed (multi-rank) kernel execution, enabling the development and testing of collective operations across multiple devices. It includes a new backend-neutral C API for inter-rank communication, a Python-based distributed test runner, and enhancements to the CI script to support multi-device task scheduling. Additionally, it adds support for asynchronous task completion in the tensormap_and_ringbuffer runtime, allowing for more flexible and efficient task scheduling.

Highlights

  • Distributed Communication Framework: Introduced a new backend-neutral C API for multi-rank communication, with implementations for HCCL (onboard) and POSIX shared memory (simulation), enabling collective operations across multiple devices.
  • Distributed Test Runner: Added a Python-based distributed test runner (DistributedCodeRunner and distributed_worker.py) to streamline the compilation, data preparation, execution, and verification of multi-process distributed kernel tests.
  • Asynchronous Task Completion: Implemented support for deferred task completion in the tensormap_and_ringbuffer runtime, allowing kernels to signal completion asynchronously through direct flag polling or indirect SDMA event handle polling.
  • Multi-Device CI Script Enhancements: Updated the ci.sh script to support flexible device specifications (ranges, comma-separated lists) and to dynamically handle multi-device tasks by querying required device counts from kernel_config.py.
  • New Distributed Examples: Added allreduce_distributed examples for aicpu_build_graph, host_build_graph, and tensormap_and_ringbuffer runtimes, demonstrating multi-rank AllReduce operations.
  • New Asynchronous Completion Demo: Included an async_completion_demo example for the tensormap_and_ringbuffer runtime, showcasing producer-consumer tasks with asynchronous completion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-architected implementation for asynchronous and distributed test execution. The changes include a new distributed test runner, a backend-neutral communication API with both HCCL and simulation implementations, and a robust mechanism for handling asynchronous task completion in the scheduler. The addition of comprehensive examples for different runtimes is also a great contribution.

My review found a few minor opportunities for code cleanup by removing unused variables in the new kernel files. The provided rule regarding cache line size for structs does not apply to these comments. Overall, the changes are of high quality and significantly enhance the testing capabilities of the project.

@PKUZHOU PKUZHOU changed the title Async dev Async runtime Mar 20, 2026
@PKUZHOU PKUZHOU changed the title Async runtime Async distributed runtime Mar 20, 2026
@PKUZHOU PKUZHOU force-pushed the async_dev branch 3 times, most recently from ff85027 to 14b9757 Compare March 24, 2026 12:43
- Introduced async_completion_demo with producer-consumer model for single and two-card hardware paths.
- Implemented async_notify_demo utilizing TNOTIFY for inter-rank notifications, ensuring consumer launch-gating based on notification counters.
- Added kernel configurations and orchestration for both demos, supporting distributed execution.
- Created golden scripts for input generation and result validation in both demos.
- Enhanced distributed code runner to facilitate multi-card kernel execution.
uv-xiao added 3 commits March 31, 2026 00:38
- Add the runtime async design note as local reference material
- Add the PR alignment note so later refactor design changes are easy to compare against the current branch state
- Record the problems with scheduler-side pre-launch gating
- Compare launch-gating and explicit wait-task designs for notification waits
- Recommend converging on deferred completion with a proxy wait task
- Call out the required counter-polling cache-invalidation fix for remote notify
- Explain that tensormap_and_ringbuffer needs a token tensor because dependencies are discovered through TensorMap producer edges
- Add orch/worker API sketches for async_notify_demo and allreduce/barrier flows
- Note briefly that aicpu_build_graph can use explicit wait-task dependencies instead
@uv-xiao
Copy link
Copy Markdown
Contributor

uv-xiao commented Mar 31, 2026

I updated the design note in docs/pr_async_completion_and_notification.md and tightened the proposed refactor direction.

Key points:

One implementation detail I called out explicitly: if we move notify waiting onto deferred COUNTER completion, the counter polling path needs cache invalidation for remote-notify-updated memory. See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L581-L590

echo_stone added 4 commits March 31, 2026 15:00
- Introduced a new MOE Dispatch V2 example featuring an 8-rank multi-expert dispatch system.
- Added orchestration for a 3-phase task DAG: Prepare, Send, and RecvAssemble.
- Implemented kernels for each phase, including token routing, data sending, and cumulative assembly.
- Created supporting Python scripts for input generation and validation.
- Enhanced runtime with updated completion handling and notification mechanisms.
- Refactored existing completion APIs to unify handling of counter-based completions.
…dates

- Updated the async_notify_demo to include a new NotifyWait kernel that registers a notification counter condition for inter-rank synchronization.
- Modified the consumer kernel to depend on the completion of NotifyWait, ensuring it only executes when the notification counter is satisfied.
- Enhanced orchestration logic to incorporate the NotifyWait phase, allowing for a more robust task dependency management.
- Refactored kernel argument layouts to accommodate the new dependency token from NotifyWait.
- Improved runtime handling by removing legacy notification wait mechanisms, streamlining the completion process.
- Remove preemptive flush_deferred_releases guard and unused lambda
  from executor loop; rely on existing inline flush-on-full and
  idle-batch-flush paths (reviewer: poursoul)
- Clarify cache_invalidate_range comment: all current counter writers
  (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so
  invalidation is always required (reviewer: uv-xiao)
- Add pto2_rt_submit_notification_wait_task() helper API to
  pto_orchestration_api.h, reducing NotifyWait boilerplate in
  orchestration code (reviewer: uv-xiao)
- Simplify async_notify_demo and moe_dispatch orchestration to use
  the new helper API
- Remove unused PTO2LocalReadyBuffer forward declaration (reviewer:
  uv-xiao)

Made-with: Cursor
PKUZHOU added a commit to PKUZHOU/simpler that referenced this pull request Mar 31, 2026
- Remove preemptive flush_deferred_releases guard and unused lambda
  from executor loop; rely on existing inline flush-on-full and
  idle-batch-flush paths (reviewer: poursoul)
- Clarify cache_invalidate_range comment: all current counter writers
  (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so
  invalidation is always required (reviewer: uv-xiao)
- Add pto2_rt_submit_notification_wait_task() helper API to
  pto_orchestration_api.h, reducing NotifyWait boilerplate in
  orchestration code (reviewer: uv-xiao)
- Simplify async_notify_demo and moe_dispatch orchestration to use
  the new helper API
- Remove unused PTO2LocalReadyBuffer forward declaration (reviewer:
  uv-xiao)

Made-with: Cursor
@PKUZHOU PKUZHOU force-pushed the async_dev branch 2 times, most recently from 8200078 to 45e631c Compare March 31, 2026 13:42
PKUZHOU added a commit to PKUZHOU/simpler that referenced this pull request Mar 31, 2026
- Remove preemptive flush_deferred_releases guard and unused lambda
  from executor loop; rely on existing inline flush-on-full and
  idle-batch-flush paths (reviewer: poursoul)
- Clarify cache_invalidate_range comment: all current counter writers
  (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so
  invalidation is always required (reviewer: uv-xiao)
- Add pto2_rt_submit_notification_wait_task() helper API to
  pto_orchestration_api.h, reducing NotifyWait boilerplate in
  orchestration code (reviewer: uv-xiao)
- Simplify async_notify_demo and moe_dispatch orchestration to use
  the new helper API
- Remove unused PTO2LocalReadyBuffer forward declaration (reviewer:
  uv-xiao)

Made-with: Cursor
@uv-xiao uv-xiao self-requested a review April 2, 2026 10:41
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/a2a3/runtime/tensormap_and_ringbuffer/distribute is not live yet. I don't have other comments.

Comment on lines +18 to +20
Run the two async distributed hardware test cases:
1. async_completion_demo
2. async_notify_demo
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoE not included?

HCCL / sim communication and the existing PTO runtime C API for kernel
execution.

Spawned by DistributedCodeRunner — not intended for direct invocation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark both distributed_code_runner.py and distributed_worker.py specially? They stand aside the run_example.py but they shouldn't be directly ruined. The current file placement is kind of confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants