Async distributed runtime by PKUZHOU · Pull Request #334 · hw-native-sys/simpler

PKUZHOU · 2026-03-20T02:02:36Z

Async_runtime implementation

gemini-code-assist · 2026-03-20T02:04:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive framework for distributed (multi-rank) kernel execution, enabling the development and testing of collective operations across multiple devices. It includes a new backend-neutral C API for inter-rank communication, a Python-based distributed test runner, and enhancements to the CI script to support multi-device task scheduling. Additionally, it adds support for asynchronous task completion in the tensormap_and_ringbuffer runtime, allowing for more flexible and efficient task scheduling.

Highlights

Distributed Communication Framework: Introduced a new backend-neutral C API for multi-rank communication, with implementations for HCCL (onboard) and POSIX shared memory (simulation), enabling collective operations across multiple devices.
Distributed Test Runner: Added a Python-based distributed test runner (DistributedCodeRunner and distributed_worker.py) to streamline the compilation, data preparation, execution, and verification of multi-process distributed kernel tests.
Asynchronous Task Completion: Implemented support for deferred task completion in the tensormap_and_ringbuffer runtime, allowing kernels to signal completion asynchronously through direct flag polling or indirect SDMA event handle polling.
Multi-Device CI Script Enhancements: Updated the ci.sh script to support flexible device specifications (ranges, comma-separated lists) and to dynamically handle multi-device tasks by querying required device counts from kernel_config.py.
New Distributed Examples: Added allreduce_distributed examples for aicpu_build_graph, host_build_graph, and tensormap_and_ringbuffer runtimes, demonstrating multi-rank AllReduce operations.
New Asynchronous Completion Demo: Included an async_completion_demo example for the tensormap_and_ringbuffer runtime, showcasing producer-consumer tasks with asynchronous completion.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-architected implementation for asynchronous and distributed test execution. The changes include a new distributed test runner, a backend-neutral communication API with both HCCL and simulation implementations, and a robust mechanism for handling asynchronous task completion in the scheduler. The addition of comprehensive examples for different runtimes is also a great contribution.

My review found a few minor opportunities for code cleanup by removing unused variables in the new kernel files. The provided rule regarding cache line size for structs does not apply to these comments. Overall, the changes are of high quality and significantly enhance the testing capabilities of the project.

examples/a2a3/aicpu_build_graph/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp

examples/a2a3/host_build_graph/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp

examples/a2a3/tensormap_and_ringbuffer/allreduce_distributed/kernels/aiv/allreduce_kernel.cpp

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h

- Introduced async_completion_demo with producer-consumer model for single and two-card hardware paths. - Implemented async_notify_demo utilizing TNOTIFY for inter-rank notifications, ensuring consumer launch-gating based on notification counters. - Added kernel configurations and orchestration for both demos, supporting distributed execution. - Created golden scripts for input generation and result validation in both demos. - Enhanced distributed code runner to facilitate multi-card kernel execution.

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h

- Add the runtime async design note as local reference material - Add the PR alignment note so later refactor design changes are easy to compare against the current branch state

- Record the problems with scheduler-side pre-launch gating - Compare launch-gating and explicit wait-task designs for notification waits - Recommend converging on deferred completion with a proxy wait task - Call out the required counter-polling cache-invalidation fix for remote notify

- Explain that tensormap_and_ringbuffer needs a token tensor because dependencies are discovered through TensorMap producer edges - Add orch/worker API sketches for async_notify_demo and allreduce/barrier flows - Note briefly that aicpu_build_graph can use explicit wait-task dependencies instead

uv-xiao · 2026-03-31T04:57:03Z

I updated the design note in docs/pr_async_completion_and_notification.md and tightened the proposed refactor direction.

Key points:

Flow: use wait_task + deferred completion instead of scheduler-side pre-launch gating. wait_task runs once, registers the counter condition, returns immediately, and AICPU later completes it through the existing async-wait path. See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L244-L252 and https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L532-L550
Visual: for tensormap_and_ringbuffer, the graph becomes producer -> consumer plus wait_task -> consumer. See the DAG: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L228-L242
API: replace pto2_rt_expect_notification_counter(...) with an orch helper that submits an explicit notification-wait task, while the worker side keeps pto2_save_expected_notification_counter(...). See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L266-L336
Example: async_notify_demo is rewritten in the doc as producer + wait_task + consumer, where consumer depends on producer output and a wait-task dependency token. See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L338-L404
Barrier / allreduce: same pattern, but with a barrier token gating the post-barrier compute. See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L406-L452
Runtime-specific note: the token tensor part is only for tensormap_and_ringbuffer because dependencies are discovered through TensorMap producer edges. In aicpu_build_graph, the same wait_task design can use explicit pto2_rt_add_dependency(...). See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L478-L509

One implementation detail I called out explicitly: if we move notify waiting onto deferred COUNTER completion, the counter polling path needs cache invalidation for remote-notify-updated memory. See: https://github.com/PKUZHOU/simpler/blob/55b7182/docs/pr_async_completion_and_notification.md#L581-L590

- Introduced a new MOE Dispatch V2 example featuring an 8-rank multi-expert dispatch system. - Added orchestration for a 3-phase task DAG: Prepare, Send, and RecvAssemble. - Implemented kernels for each phase, including token routing, data sending, and cumulative assembly. - Created supporting Python scripts for input generation and validation. - Enhanced runtime with updated completion handling and notification mechanisms. - Refactored existing completion APIs to unify handling of counter-based completions.

…dates - Updated the async_notify_demo to include a new NotifyWait kernel that registers a notification counter condition for inter-rank synchronization. - Modified the consumer kernel to depend on the completion of NotifyWait, ensuring it only executes when the notification counter is satisfied. - Enhanced orchestration logic to incorporate the NotifyWait phase, allowing for a more robust task dependency management. - Refactored kernel argument layouts to accommodate the new dependency token from NotifyWait. - Improved runtime handling by removing legacy notification wait mechanisms, streamlining the completion process.

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h

...sormap_and_ringbuffer/async_notify_demo/kernels/orchestration/async_notify_orchestration.cpp

src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h

- Remove preemptive flush_deferred_releases guard and unused lambda from executor loop; rely on existing inline flush-on-full and idle-batch-flush paths (reviewer: poursoul) - Clarify cache_invalidate_range comment: all current counter writers (SDMA flags, TNOTIFY RDMA atomics) bypass AICPU cache, so invalidation is always required (reviewer: uv-xiao) - Add pto2_rt_submit_notification_wait_task() helper API to pto_orchestration_api.h, reducing NotifyWait boilerplate in orchestration code (reviewer: uv-xiao) - Simplify async_notify_demo and moe_dispatch orchestration to use the new helper API - Remove unused PTO2LocalReadyBuffer forward declaration (reviewer: uv-xiao) Made-with: Cursor

uv-xiao

src/a2a3/runtime/tensormap_and_ringbuffer/distribute is not live yet. I don't have other comments.

uv-xiao · 2026-04-02T10:47:01Z

examples/scripts/run_async_tests.sh

+Run the two async distributed hardware test cases:
+  1. async_completion_demo
+  2. async_notify_demo


MoE not included?

uv-xiao · 2026-04-02T10:49:13Z

examples/scripts/distributed_worker.py

+HCCL / sim communication and the existing PTO runtime C API for kernel
+execution.
+
+Spawned by DistributedCodeRunner — not intended for direct invocation.


Should we mark both distributed_code_runner.py and distributed_worker.py specially? They stand aside the run_example.py but they shouldn't be directly ruined. The current file placement is kind of confusing.

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

poursoul reviewed Mar 20, 2026

View reviewed changes

PKUZHOU changed the title ~~Async dev~~ Async runtime Mar 20, 2026

PKUZHOU changed the title ~~Async runtime~~ Async distributed runtime Mar 20, 2026

PKUZHOU force-pushed the async_dev branch 3 times, most recently from ff85027 to 14b9757 Compare March 24, 2026 12:43

PKUZHOU force-pushed the async_dev branch from 11dc229 to 7d79a97 Compare March 26, 2026 02:32

poursoul reviewed Mar 26, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h Outdated Show resolved Hide resolved

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp Outdated Show resolved Hide resolved

poursoul reviewed Mar 26, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h Show resolved Hide resolved

uv-xiao added 3 commits March 31, 2026 00:38

Add: record async runtime design docs

25fa5f4

- Add the runtime async design note as local reference material - Add the PR alignment note so later refactor design changes are easy to compare against the current branch state

echo_stone added 4 commits March 31, 2026 15:00

change rq to sq

38ef043

add scripts for async demo test

5f4e78b

uv-xiao reviewed Mar 31, 2026

View reviewed changes

PKUZHOU force-pushed the async_dev branch from 8200078 to 71b4b3c Compare March 31, 2026 13:34

PKUZHOU force-pushed the async_dev branch 2 times, most recently from 8200078 to 45e631c Compare March 31, 2026 13:42

PKUZHOU force-pushed the async_dev branch from 45e631c to 8200078 Compare March 31, 2026 13:43

uv-xiao self-requested a review April 2, 2026 10:41

uv-xiao approved these changes Apr 2, 2026

View reviewed changes

Conversation

PKUZHOU commented Mar 20, 2026

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao left a comment

Choose a reason for hiding this comment

Uh oh!

uv-xiao Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

uv-xiao Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants