Add unified PTO runtime package (L2/L3)#355
Add unified PTO runtime package (L2/L3)#355hw-native-sys-bot wants to merge 4 commits intohw-native-sys:mainfrom
Conversation
Integrate AscendC operators into PTO runtime via single-TU compile + link: - Add AscendCToolchain to toolchain.py with --cce-aicore-lang dialect, SDK include paths from bisheng_intf.cmake, and auto-sync flags - Add ascendc_compiler.py with single-TU approach: merge kernel_entry wrapper + user source into one translation unit, compile with AscendC flags, link with ld.lld to resolve block-local relocations - Dispatch compiler='ascendc' kernels in code_runner.py - Add ascendc_vector_example device test (z=x+y via AscendC, w=z*z via PTO) under tests/device_tests/ to avoid sim CI discovery - Add unit tests for wrapper generation, merged source generation, artifact extraction, and toolchain flags
- Remove source compilation path from ascendc_compiler.py (no more _compile_kernel_source, use_ascendc_compiler, ascendc_kernel_source) - compile_ascendc_kernel() now requires pre-compiled .o bytes directly - Update code_runner.py to reject non-.o sources for ascendc kernels - kernel_config.py now references add_custom.o (externally compiled) - add_custom.cpp is reference-only documentation (wrapped in #if 0) - Update tests to match simplified API
- Add pre-compiled add_custom.o (AscendC AddCustom, float32, dav-c220-vec) with two critical adaptations for PTO dispatch: 1. No __global__ attribute (causes hang under PTO subroutine dispatch) 2. No GetBlockNum()/GetBlockIdx() (PTO dispatches to single cores) - Add build_add_custom.sh to reproduce the .o from source - Fix __gm__ address space qualifier in wrapper forward declaration: workspace and tiling params use plain uint8_t* (not __gm__) since PTO's ccec treats __gm__ as a real address-space qualifier - Update reference source and kernel_config docs - Device test passes: z (add) and w (mul) match golden on hardware
Implements the python/pto/ package with a unified Runtime API that
routes by level ("chip" for L2, "host" for L3). L3 manages per-chip
worker processes, a handle-based task DAG with eager dispatch, and
Python orchestration functions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a unified Python Tensor Orchestration (PTO) runtime package, enabling both single-chip (L2) and multi-chip (L3) execution on Ascend NPU devices. It provides a high-level API for registering and running computations, manages task dependencies, and incorporates caching mechanisms for efficient compilation. The changes also include support for integrating pre-compiled AscendC kernels, expanding the range of supported operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Summary
python/pto/package with unifiedRuntimeAPI that routes by level ("chip"→ L2,"host"→ L3)bindings.pyctypes interface behindregister()/run()/close()pto.compile()with SHA256-based caching for compiled artifactsFile layout
Usage
Test plan
pto.Runtime(level="chip")runs existing single-chip examples🤖 Generated with Claude Code