Skip to content

Add unified PTO runtime package (L2/L3)#355

Draft
hw-native-sys-bot wants to merge 4 commits intohw-native-sys:mainfrom
hw-native-sys-bot:feat/unified-runtime-l3
Draft

Add unified PTO runtime package (L2/L3)#355
hw-native-sys-bot wants to merge 4 commits intohw-native-sys:mainfrom
hw-native-sys-bot:feat/unified-runtime-l3

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

  • Adds python/pto/ package with unified Runtime API that routes by level ("chip" → L2, "host" → L3)
  • L2Runtime wraps existing bindings.py ctypes interface behind register()/run()/close()
  • L3Runtime manages per-chip worker processes (one process per device due to DeviceRunner singleton), a handle-based task DAG with eager dispatch, and Python orchestration functions
  • Includes pto.compile() with SHA256-based caching for compiled artifacts

File layout

python/pto/
  __init__.py       — Package exports: Runtime, Arg, compile
  types.py          — Arg, TensorHandle, CompiledPackage, ParamType
  compiler.py       — pto.compile() + caching
  runtime.py        — Unified Runtime entry (routes by level)
  l2_runtime.py     — Single-chip runtime (wraps bindings.py)
  l3_runtime.py     — Multi-chip runtime (workers + DAG + orch)
  l3_context.py     — L3OrchestratorContext (ctx for Python orch)
  l3_worker.py      — ChipWorker subprocess management
  dag.py            — TaskDAG with handle-based dependency inference

Usage

import pto

# L2 — single chip
rt = pto.Runtime(level="chip", platform="a2a3", device=0)
rt.register("vector_add", orch="orch.cpp", kernels=[...])
rt.run("vector_add", args=[pto.Arg.input(x), pto.Arg.output(y)])
rt.close()

# L3 — multi chip
rt = pto.Runtime(level="host", platform="a2a3", devices=[0, 1, 2, 3])
pkg = pto.compile(platform="a2a3", orch="orch.cpp", kernels=[...])
rt.register("pipeline", orch=my_orch_func, kernels={"compute": pkg})
rt.run("pipeline", args={"input": data})
rt.close()

Test plan

  • Verify L2 path: pto.Runtime(level="chip") runs existing single-chip examples
  • Verify L3 path: multi-chip DAG dispatch with sim platform
  • Unit tests for TaskDAG dependency inference
  • Unit tests for L3OrchestratorContext

🤖 Generated with Claude Code

ChaoWao and others added 4 commits March 20, 2026 21:35
Integrate AscendC operators into PTO runtime via single-TU compile + link:

- Add AscendCToolchain to toolchain.py with --cce-aicore-lang dialect,
  SDK include paths from bisheng_intf.cmake, and auto-sync flags
- Add ascendc_compiler.py with single-TU approach: merge kernel_entry
  wrapper + user source into one translation unit, compile with AscendC
  flags, link with ld.lld to resolve block-local relocations
- Dispatch compiler='ascendc' kernels in code_runner.py
- Add ascendc_vector_example device test (z=x+y via AscendC, w=z*z via
  PTO) under tests/device_tests/ to avoid sim CI discovery
- Add unit tests for wrapper generation, merged source generation,
  artifact extraction, and toolchain flags
- Remove source compilation path from ascendc_compiler.py (no more
  _compile_kernel_source, use_ascendc_compiler, ascendc_kernel_source)
- compile_ascendc_kernel() now requires pre-compiled .o bytes directly
- Update code_runner.py to reject non-.o sources for ascendc kernels
- kernel_config.py now references add_custom.o (externally compiled)
- add_custom.cpp is reference-only documentation (wrapped in #if 0)
- Update tests to match simplified API
- Add pre-compiled add_custom.o (AscendC AddCustom, float32, dav-c220-vec)
  with two critical adaptations for PTO dispatch:
  1. No __global__ attribute (causes hang under PTO subroutine dispatch)
  2. No GetBlockNum()/GetBlockIdx() (PTO dispatches to single cores)
- Add build_add_custom.sh to reproduce the .o from source
- Fix __gm__ address space qualifier in wrapper forward declaration:
  workspace and tiling params use plain uint8_t* (not __gm__) since
  PTO's ccec treats __gm__ as a real address-space qualifier
- Update reference source and kernel_config docs
- Device test passes: z (add) and w (mul) match golden on hardware
Implements the python/pto/ package with a unified Runtime API that
routes by level ("chip" for L2, "host" for L3). L3 manages per-chip
worker processes, a handle-based task DAG with eager dispatch, and
Python orchestration functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a unified Python Tensor Orchestration (PTO) runtime package, enabling both single-chip (L2) and multi-chip (L3) execution on Ascend NPU devices. It provides a high-level API for registering and running computations, manages task dependencies, and incorporates caching mechanisms for efficient compilation. The changes also include support for integrating pre-compiled AscendC kernels, expanding the range of supported operations.

Highlights

  • Unified PTO Runtime: Introduces a unified python/pto/ package providing a consistent Runtime API for both L2 (single-chip) and L3 (multi-chip) execution on Ascend NPU devices.
  • L2 and L3 Runtime Implementations: Includes L2Runtime (wraps existing bindings.py) and L3Runtime (manages per-chip worker processes and a handle-based task DAG).
  • AscendC Kernel Integration: Adds support for integrating pre-compiled AscendC kernels into the PTO runtime, including a new AscendCCompiler class and workflow.
  • Compilation and Caching: Implements pto.compile() with SHA256-based caching for compiled artifacts, improving build times.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@ChaoWao ChaoWao marked this pull request as draft March 25, 2026 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants