diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md b/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md new file mode 100644 index 00000000..7ca172c0 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md @@ -0,0 +1,137 @@ +# Benchmark Scene Tests + +This directory contains benchmark scene tests for the `tensormap_and_ringbuffer` runtime on the A2/A3 platform. These tests are designed to systematically characterize runtime performance across two dimensions: **dispatch overhead** and **graph topology**. + +All tests use trivial kernels (noop or increment-by-one) to isolate runtime scheduling overhead from compute. Results are collected via `tools/benchmark_rounds.sh`. + +## Scene 1: Dispatch & Scheduling Overhead + +These tests isolate and quantify the runtime's "scheduling tax" — framework overhead independent of kernel computation. + +### dispatch-independent (Task Scaling) + +**Intent**: Measure how dispatch overhead grows with task count when tasks are fully independent (no inter-task data dependencies). + +Each task writes `1.0` to its own cache-line-aligned slot (stride = 16 float32 = 64 bytes) in a shared output tensor, avoiding false sharing across non-coherent AICore L1 caches. + +| Parameter | Values | +| --------- | ------ | +| num_tasks | 100, 500, 1000, 2000 | +| mode | AIC-only, AIV-only, AIC+AIV alternating | + +**What to look for**: Linear growth in total dispatch time vs. task count. Super-linear growth indicates a scheduling bottleneck (e.g., O(N^2) dependency tracking). + +### dispatch-serial (Dispatch Throughput) + +**Intent**: Measure maximum scheduler throughput under serial task submission with accumulation dependencies. + +All N tasks write to the same counter (AIC counter or AIV counter), forming a serial dependency chain. The final counter value equals N, validating correctness. + +| Parameter | Values | +| --------- | ------ | +| num_tasks | 100, 500, 1000, 2000 | +| mode | AIC-only, AIV-only, AIC+AIV alternating | + +**What to look for**: Per-task dispatch latency (total time / N). Compare with `dispatch-independent` to quantify the overhead of serial dependencies vs. independent dispatch. + +## Scene 2: Graph Topology Patterns + +These tests stress-test the scheduler with different DAG dependency structures. Each topology exercises a different aspect of dependency resolution. + +### graph-chain_n (Linear Chain) + +**Intent**: Measure serial dependency resolution overhead as chain length increases. + +```text +seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result(N.0) +``` + +Each task is an AIV increment kernel (`out = in + 1.0`). The result equals the chain length, validating every link executed. + +| Parameter | Values | +| --------- | ------ | +| chain_len | 4, 8, 16, 32, 64 | + +**What to look for**: End-to-end latency vs. chain length. Ideally linear; deviation reveals per-hop scheduling overhead. + +### graph-fanout_n (Wide Fan-Out) + +**Intent**: Test parallel dispatch capability — can the runtime simultaneously issue N independent tasks from a single source? + +```text +seed -> [Source] -> intermediate -> [Consumer_0] -> result[0] + -> [Consumer_1] -> result[1] + -> ... + -> [Consumer_{N-1}] -> result[N-1] +``` + +Consumer output slots are cache-line-aligned to avoid false sharing. Each consumer reads the same source output and writes `source + 1.0`. + +| Parameter | Values | +| --------- | ------ | +| fanout_width | 2, 4, 8, 15 | + +**What to look for**: Whether fan-out width impacts total latency. Ideal runtime dispatches all consumers in parallel, so latency should plateau rather than grow linearly. + +### graph-fanin_n (Convergence Barrier) + +**Intent**: Measure dependency convergence overhead — how efficiently the runtime tracks N predecessors for a single barrier task. + +```text +seed -> [Producer_0] -> prod_out_0 -. +seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result(1.0) +... | +seed -> [Producer_{N-1}] -> ... -' +``` + +Each producer writes independently; the barrier depends on all N producer outputs. + +| Parameter | Values | +| --------- | ------ | +| fanin_width | 2, 4, 8, 15 | + +**What to look for**: Barrier wait overhead vs. fan-in width. Measures the cost of tracking and synchronizing N predecessor completions. + +### graph-diamond (Fork-Join) + +**Intent**: Test the most common real-world DAG pattern — fan-out followed by fan-in (fork-join). + +```text +seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -. + -> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result(1.0) + -> ... | + -> [Branch B_{W-1}] -> ... -' +``` + +Three branch modes exercise different core-type scheduling paths: + +- **mode=0**: All AIV branches +- **mode=1**: All AIC branches +- **mode=2**: Mixed AIC+AIV (even=AIC, odd=AIV) + +| Parameter | Values | +| --------- | ------ | +| width | 2, 4, 8, 15 | +| mode | AIV-only, AIC-only, Mixed AIC+AIV | + +**What to look for**: Combined fan-out + fan-in overhead. Compare with isolated fanout/fanin tests to check for compounding effects. Mixed mode reveals cross-core-type scheduling costs. + +## Also Updated: benchmark_bgemm + +The existing `benchmark_bgemm` test was extended with structured parameter sweeps: + +- **Tile size sweep** (16, 32, 64, 128) at fixed batch and grid_k +- **Batch/group sweep** (1, 4, 16, 64 groups) at fixed tile size +- **Grid-K sweep** (1, 2, 4) at fixed tile and batch + +These complement the original 5 cases with systematic single-variable sweeps for identifying performance cliffs. + +## Running + +```bash +# Run all benchmark scene tests (100 rounds each, default) +./tools/benchmark_rounds.sh + +# Customize +./tools/benchmark_rounds.sh -n 50 -d 0 -p a2a3 -r tensormap_and_ringbuffer -v +``` diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py index 444b2997..3e698cf4 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py @@ -1,3 +1,11 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- """ Golden test specification for BGEMM (tensormap_and_ringbuffer Runtime). @@ -24,7 +32,7 @@ SUPPORTED_INCORE_DATA_SIZES = {16, 32, 64, 128} ALL_CASES = { - "Case0": { + "Case1": { "matmul_add_task_num": 500, "incore_task_granularity": { "incore_data_size": 128, @@ -32,41 +40,73 @@ }, "grid_k": 2, }, - "Case1": { - "matmul_add_task_num": 64, - "incore_task_granularity": { - "incore_data_size": 128, - "incore_loop": 4, - }, + # --- Tile Size Sweep (fixed: num_groups=16, grid_k=2, incore_loop=4) --- + "Tile16": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 16, "incore_loop": 4}, "grid_k": 2, }, - "Case2": { - "matmul_add_task_num": 256, - "incore_task_granularity": { - "incore_data_size": 128, - "incore_loop": 4, - }, + "Tile32": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 32, "incore_loop": 4}, "grid_k": 2, }, - "Case3": { - "matmul_add_task_num": 64, - "incore_task_granularity": { - "incore_data_size": 128, - "incore_loop": 16, - }, + "Tile64": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 64, "incore_loop": 4}, + "grid_k": 2, + }, + "Tile128": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, "grid_k": 2, }, - "Case4": { + # --- Batch/Group Sweep (fixed: tile=128, grid_k=2, incore_loop=4) --- + "Batch1": { + "matmul_add_task_num": 2, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, + "grid_k": 2, + }, + "Batch4": { + "matmul_add_task_num": 8, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, + "grid_k": 2, + }, + "Batch64": { + "matmul_add_task_num": 128, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, + "grid_k": 2, + }, + # --- K Dimension Sweep (fixed: tile=128, num_groups=16, incore_loop=4) --- + "K1": { + "matmul_add_task_num": 16, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, + "grid_k": 1, + }, + "K4": { "matmul_add_task_num": 64, - "incore_task_granularity": { - "incore_data_size": 128, - "incore_loop": 4, - }, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, "grid_k": 4, }, + "K8": { + "matmul_add_task_num": 128, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4}, + "grid_k": 8, + }, + # --- In-Core Loop Sweep (fixed: tile=128, num_groups=16, grid_k=2) --- + "Loop1": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 1}, + "grid_k": 2, + }, + "Loop16": { + "matmul_add_task_num": 32, + "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 16}, + "grid_k": 2, + }, } -DEFAULT_CASE = "Case0" +DEFAULT_CASE = "Case1" def generate_inputs(params: dict) -> list: @@ -80,18 +120,14 @@ def generate_inputs(params: dict) -> list: # --- constraint checks --- if tile_size not in SUPPORTED_INCORE_DATA_SIZES: raise ValueError( - f"incore_data_size={tile_size} is not supported. " - f"Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}." + f"incore_data_size={tile_size} is not supported. Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}." ) if incore_loop <= 0: raise ValueError(f"incore_loop must be positive, got {incore_loop}") if grid_k <= 0: raise ValueError(f"grid_k must be positive, got {grid_k}") if matmul_add_task_num % grid_k != 0: - raise ValueError( - f"matmul_add_task_num ({matmul_add_task_num}) must be " - f"divisible by grid_k ({grid_k})." - ) + raise ValueError(f"matmul_add_task_num ({matmul_add_task_num}) must be divisible by grid_k ({grid_k}).") num_groups = matmul_add_task_num // grid_k diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py new file mode 100644 index 00000000..d860d732 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py @@ -0,0 +1,79 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for task_scaling test. + +Measures dispatch overhead vs task count. Submits N independent noop tasks, +each writing 1.0 to a separate cache-line-aligned slot. Output tensor is +padded so each task's slot sits on its own cache line (stride = 16 float32 +elements = 64 bytes), avoiding false sharing across non-coherent AICore L1 +caches. + +Cases parameterize task count (100→2000) and core type: + AIC-only sweep: 100, 500, 1000, 2000 tasks + AIV-only sweep: 100, 500, 1000, 2000 tasks + AIC+AIV sweep: 100, 500, 1000, 2000 tasks + +Args layout: [output, num_tasks, mode] +""" + +import ctypes + +import torch + +__outputs__ = ["output"] + +RTOL = 1e-5 +ATOL = 1e-5 + +# Each task writes to a separate cache line to avoid false sharing +# across non-coherent AICore L1 caches (64B = 16 float32 elements). +CACHE_LINE_ELEMS = 16 + +ALL_CASES = { + # AIC-only (mode=0) + "Case1": {"num_tasks": 100, "mode": 0}, + "Case2": {"num_tasks": 500, "mode": 0}, + "Case3": {"num_tasks": 1000, "mode": 0}, + "Case4": {"num_tasks": 2000, "mode": 0}, + # AIV-only (mode=1) + "Case5": {"num_tasks": 100, "mode": 1}, + "Case6": {"num_tasks": 500, "mode": 1}, + "Case7": {"num_tasks": 1000, "mode": 1}, + "Case8": {"num_tasks": 2000, "mode": 1}, + # AIC+AIV alternating (mode=2) + "Case9": {"num_tasks": 100, "mode": 2}, + "Case10": {"num_tasks": 500, "mode": 2}, + "Case11": {"num_tasks": 1000, "mode": 2}, + "Case12": {"num_tasks": 2000, "mode": 2}, +} + +DEFAULT_CASE = "Case2" + + +def generate_inputs(params: dict) -> list: + num_tasks = params["num_tasks"] + mode = params["mode"] + + output = torch.zeros(num_tasks * CACHE_LINE_ELEMS, dtype=torch.float32) + + return [ + ("output", output), + ("num_tasks", ctypes.c_int64(num_tasks)), + ("mode", ctypes.c_int64(mode)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + num_tasks = params["num_tasks"] + output = torch.as_tensor(tensors["output"]) + + # Each independent task writes 1.0 to its cache-line-aligned slot + for i in range(num_tasks): + output[i * CACHE_LINE_ELEMS] = 1.0 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp new file mode 100644 index 00000000..ca0e7669 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp @@ -0,0 +1,40 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * No-op AIC Kernel for Task Scaling + * + * Minimal cube kernel that performs a trivial write. Each task writes 1.0 + * at its designated position in the output tensor, proving execution order. + * + * Args: + * args[0] = output tensor (INOUT) - single float32 element per task + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out = 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp new file mode 100644 index 00000000..e0e6d4ae --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp @@ -0,0 +1,40 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * No-op AIV Kernel for Task Scaling + * + * Minimal vector kernel that performs a trivial write. Each task writes 1.0 + * at its designated position in the output tensor, proving execution order. + * + * Args: + * args[0] = output tensor (INOUT) - single float32 element per task + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out = 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py new file mode 100644 index 00000000..4973d254 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py @@ -0,0 +1,48 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for task_scaling test (tensormap_and_ringbuffer). + +Measures dispatch overhead growth as task count scales from 1 to 1000. + +Kernels: + func_id=0: kernel_noop_aic (AIC) - trivial write kernel + func_id=1: kernel_noop_aiv (AIV) - trivial write kernel +""" + +from pathlib import Path + +_KERNELS_ROOT = Path(__file__).parent + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "task_scaling_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "NOOP_AIC", + "source": str(_KERNELS_ROOT / "aic" / "kernel_noop_aic.cpp"), + "core_type": "aic", + }, + { + "func_id": 1, + "name": "NOOP_AIV", + "source": str(_KERNELS_ROOT / "aiv" / "kernel_noop_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp new file mode 100644 index 00000000..658ce245 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp @@ -0,0 +1,90 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Task Scaling Orchestration (tensormap_and_ringbuffer Runtime) + * + * Measures dispatch overhead growth as task count scales. Submits N + * independent noop tasks, each writing 1.0 to a separate slot in the + * output tensor. Tasks are independent (no inter-task data dependency) + * to isolate pure scheduling overhead from serialization effects. + * + * Three modes: + * mode=0: AIC-only — N independent AIC noop tasks + * mode=1: AIV-only — N independent AIV noop tasks + * mode=2: AIC+AIV — alternating AIC/AIV independent noop tasks + * + * Arg layout: [output, num_tasks, mode] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_NOOP_AIC 0 +#define FUNC_NOOP_AIV 1 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 3, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor output = from_tensor_arg(orch_args.tensor(0)); + int num_tasks = static_cast(orch_args.scalar(0)); + int mode = static_cast(orch_args.scalar(1)); + + LOG_ALWAYS("[task_scaling] num_tasks=%d, mode=%d (0=AIC, 1=AIV, 2=AIC+AIV)", num_tasks, mode); + + // Each task writes to a separate cache line (64B = 16 float32 elements) + // to avoid false sharing across non-coherent AICore L1 caches. + constexpr uint32_t CACHE_LINE_ELEMS = 16; + uint32_t slot_shapes[1] = {1}; + + for (int i = 0; i < num_tasks; i++) { + uint32_t view_offsets[1] = {static_cast(i * CACHE_LINE_ELEMS)}; + Tensor slot = output.view(slot_shapes, view_offsets); + + if (mode == 0) { + Arg params; + params.add_inout(slot); + pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params); + } else if (mode == 1) { + Arg params; + params.add_inout(slot); + pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params); + } else { + if (i % 2 == 0) { + Arg params; + params.add_inout(slot); + pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params); + } else { + Arg params; + params.add_inout(slot); + pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params); + } + } + } + + LOG_ALWAYS("[task_scaling] Submitted %d independent tasks", num_tasks); +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py new file mode 100644 index 00000000..2a506f99 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py @@ -0,0 +1,96 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for dispatch_throughput test. + +Measures scheduler throughput by submitting N noop tasks serially. +Each task increments a counter, so the final output equals N (or N/2 +for each core type in AIC+AIV mode). + +Cases sweep across task counts and core types: + Case1: 100 AIC-only tasks + Case2: 500 AIC-only tasks + Case3: 1000 AIC-only tasks + Case4: 2000 AIC-only tasks + Case5: 100 AIV-only tasks + Case6: 500 AIV-only tasks + Case7: 1000 AIV-only tasks + Case8: 2000 AIV-only tasks + Case9: 100 AIC+AIV alternating tasks + Case10: 500 AIC+AIV alternating tasks + Case11:1000 AIC+AIV alternating tasks + Case12:2000 AIC+AIV alternating tasks + +Args layout: [out_aic, out_aiv, num_tasks, mode] +""" + +import ctypes + +import torch + +__outputs__ = ["out_aic", "out_aiv"] + +RTOL = 1e-3 +ATOL = 1e-1 # Accumulated float additions may drift slightly + +ALL_CASES = { + # AIC-only (mode=0) + "Case1": {"num_tasks": 100, "mode": 0}, + "Case2": {"num_tasks": 500, "mode": 0}, + "Case3": {"num_tasks": 1000, "mode": 0}, + "Case4": {"num_tasks": 2000, "mode": 0}, + # AIV-only (mode=1) + "Case5": {"num_tasks": 100, "mode": 1}, + "Case6": {"num_tasks": 500, "mode": 1}, + "Case7": {"num_tasks": 1000, "mode": 1}, + "Case8": {"num_tasks": 2000, "mode": 1}, + # AIC+AIV alternating (mode=2) + "Case9": {"num_tasks": 100, "mode": 2}, + "Case10": {"num_tasks": 500, "mode": 2}, + "Case11": {"num_tasks": 1000, "mode": 2}, + "Case12": {"num_tasks": 2000, "mode": 2}, +} + +DEFAULT_CASE = "Case2" + + +def generate_inputs(params: dict) -> list: + num_tasks = params["num_tasks"] + mode = params["mode"] + + out_aic = torch.zeros(1, dtype=torch.float32) + out_aiv = torch.zeros(1, dtype=torch.float32) + + return [ + ("out_aic", out_aic), + ("out_aiv", out_aiv), + ("num_tasks", ctypes.c_int64(num_tasks)), + ("mode", ctypes.c_int64(mode)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + num_tasks = params["num_tasks"] + mode = params["mode"] + + out_aic = torch.as_tensor(tensors["out_aic"]) + out_aiv = torch.as_tensor(tensors["out_aiv"]) + + if mode == 0: + # AIC-only: all N tasks increment out_aic + out_aic[0] = float(num_tasks) + elif mode == 1: + # AIV-only: all N tasks increment out_aiv + out_aiv[0] = float(num_tasks) + elif mode == 2: + # AIC+AIV alternating: even tasks → AIC, odd tasks → AIV + aic_count = (num_tasks + 1) // 2 + aiv_count = num_tasks // 2 + out_aic[0] = float(aic_count) + out_aiv[0] = float(aiv_count) diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp new file mode 100644 index 00000000..e5c13811 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp @@ -0,0 +1,41 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * No-op AIC Kernel for Dispatch Throughput + * + * Minimal cube kernel that writes a single scalar to prove execution. + * The kernel reads the current accumulated value, adds 1.0, and writes back. + * With N tasks, the final output should be N.0. + * + * Args: + * args[0] = output tensor (INOUT) - single float32 element + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out = *out + 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp new file mode 100644 index 00000000..2015e0ed --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp @@ -0,0 +1,41 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * No-op AIV Kernel for Dispatch Throughput + * + * Minimal vector kernel that writes a single scalar to prove execution. + * The kernel reads the current accumulated value, adds 1.0, and writes back. + * With N tasks, the final output should be N.0. + * + * Args: + * args[0] = output tensor (INOUT) - single float32 element + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out = *out + 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py new file mode 100644 index 00000000..c6813dcd --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py @@ -0,0 +1,48 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for dispatch_throughput test (tensormap_and_ringbuffer). + +Measures scheduler throughput by submitting N noop tasks serially. + +Kernels: + func_id=0: kernel_noop_aic (AIC) - empty cube kernel, increments counter + func_id=1: kernel_noop_aiv (AIV) - empty vector kernel, increments counter +""" + +from pathlib import Path + +_KERNELS_ROOT = Path(__file__).parent + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "dispatch_throughput_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "NOOP_AIC", + "source": str(_KERNELS_ROOT / "aic" / "kernel_noop_aic.cpp"), + "core_type": "aic", + }, + { + "func_id": 1, + "name": "NOOP_AIV", + "source": str(_KERNELS_ROOT / "aiv" / "kernel_noop_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp new file mode 100644 index 00000000..1817b994 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp @@ -0,0 +1,86 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Dispatch Throughput Orchestration (tensormap_and_ringbuffer Runtime) + * + * Measures scheduler throughput by submitting N noop tasks serially. + * Each task increments a counter by 1.0, so the final output equals N. + * + * Three modes: + * mode=0: AIC-only — N AIC noop tasks + * mode=1: AIV-only — N AIV noop tasks + * mode=2: AIC+AIV — alternating AIC/AIV noop tasks (N total) + * + * All tasks are chained through the same output tensor (INOUT) to enforce + * serial execution order — each task must wait for the previous one to + * complete before it can read the accumulated value. + * + * Arg layout: [out_aic, out_aiv, num_tasks, mode] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_NOOP_AIC 0 +#define FUNC_NOOP_AIV 1 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 4, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor out_aic = from_tensor_arg(orch_args.tensor(0)); + Tensor out_aiv = from_tensor_arg(orch_args.tensor(1)); + int num_tasks = static_cast(orch_args.scalar(0)); + int mode = static_cast(orch_args.scalar(1)); + + LOG_ALWAYS("[dispatch_throughput] num_tasks=%d, mode=%d (0=AIC, 1=AIV, 2=AIC+AIV)", num_tasks, mode); + + for (int i = 0; i < num_tasks; i++) { + if (mode == 0) { + Arg params; + params.add_inout(out_aic); + pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params); + } else if (mode == 1) { + Arg params; + params.add_inout(out_aiv); + pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params); + } else { + // Alternating AIC/AIV + if (i % 2 == 0) { + Arg params; + params.add_inout(out_aic); + pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params); + } else { + Arg params; + params.add_inout(out_aiv); + pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params); + } + } + } + + LOG_ALWAYS("[dispatch_throughput] Submitted %d tasks", num_tasks); +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py new file mode 100644 index 00000000..104aae5d --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py @@ -0,0 +1,59 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for chain_N test (linear dependency chain). + +Builds a chain of N tasks where each adds 1.0 to its input: + seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result + result = N.0 + +Measures dependency chain resolution overhead vs chain length. + +Cases sweep chain length: 4, 8, 16, 32, 64. + +Args layout: [seed, result, chain_len] +""" + +import ctypes + +import torch + +__outputs__ = ["result"] + +RTOL = 1e-5 +ATOL = 1e-5 + +ALL_CASES = { + "Chain4": {"chain_len": 4}, + "Chain8": {"chain_len": 8}, + "Chain16": {"chain_len": 16}, + "Chain32": {"chain_len": 32}, + "Chain64": {"chain_len": 64}, +} + +DEFAULT_CASE = "Chain32" + + +def generate_inputs(params: dict) -> list: + chain_len = params["chain_len"] + + seed = torch.zeros(1, dtype=torch.float32) + result = torch.zeros(1, dtype=torch.float32) + + return [ + ("seed", seed), + ("result", result), + ("chain_len", ctypes.c_int64(chain_len)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + chain_len = params["chain_len"] + result = torch.as_tensor(tensors["result"]) + result[0] = float(chain_len) diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp new file mode 100644 index 00000000..e2231252 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp @@ -0,0 +1,43 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Increment AIV Kernel for Graph Topology Tests + * + * Minimal vector kernel: reads one input scalar, writes output = input + 1.0. + * Used to build DAG chains with verifiable accumulated values. + * + * Args: + * args[0] = input tensor (INPUT) - single float32 element + * args[1] = output tensor (OUTPUT/INOUT) - single float32 element + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* in_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[1]); + __gm__ float* in_ptr = reinterpret_cast<__gm__ float*>(in_tensor->buffer.addr) + in_tensor->start_offset; + __gm__ float* out_ptr = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out_ptr = *in_ptr + 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp new file mode 100644 index 00000000..7c0ff340 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp @@ -0,0 +1,42 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * No-op AIV Kernel for Graph Topology Tests + * + * Minimal vector kernel that increments first arg by 1.0 (INOUT pattern). + * Additional args beyond args[0] are ignored by the kernel but create + * runtime dependencies for barrier/merge tasks in fan-in and diamond topologies. + * + * Args: + * args[0] = output tensor (INOUT) - single float32 element + * args[1..N] = dependency inputs (INPUT, ignored by kernel) + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out = *out + 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py new file mode 100644 index 00000000..e5cbc61a --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py @@ -0,0 +1,42 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for chain_N test (tensormap_and_ringbuffer). + +Linear dependency chain: seed -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result. +Uses a single AIV increment kernel (out = in + 1.0). + +Kernels: + func_id=0: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0 +""" + +from pathlib import Path + +_KERNELS_ROOT = Path(__file__).parent + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "chain_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "INC", + "source": str(_KERNELS_ROOT / "aiv" / "kernel_inc_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp new file mode 100644 index 00000000..9e7e597f --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp @@ -0,0 +1,90 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Chain Orchestration (tensormap_and_ringbuffer Runtime) + * + * Builds a linear dependency chain of N tasks: + * seed -> [Task_0] -> intermediate_0 -> [Task_1] -> ... -> [Task_{N-1}] -> result + * + * Each task reads its input and writes output = input + 1.0. + * After N tasks, result = N.0 (starting from seed = 0.0). + * + * Tasks 0..N-2 produce runtime-allocated intermediate tensors (OUTPUT). + * Task N-1 writes to the external result tensor (INOUT). + * This tests the runtime's INPUT->OUTPUT dependency resolution across a chain. + * + * Arg layout: [seed, result, chain_len] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_INC 0 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 3, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor seed = from_tensor_arg(orch_args.tensor(0)); + Tensor result = from_tensor_arg(orch_args.tensor(1)); + int chain_len = static_cast(orch_args.scalar(0)); + + LOG_ALWAYS("[chain_N] chain_len=%d", chain_len); + + uint32_t scalar_shape[1] = {1}; + TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32); + + if (chain_len == 1) { + // Single task: seed -> result + Arg params; + params.add_input(seed); + params.add_inout(result); + pto2_rt_submit_aiv_task(FUNC_INC, params); + } else { + // First task: seed -> intermediate_0 + Arg first_params; + first_params.add_input(seed); + first_params.add_output(ci); + TaskOutputTensors prev = pto2_rt_submit_aiv_task(FUNC_INC, first_params); + + // Middle tasks: intermediate_{i-1} -> intermediate_i + for (int i = 1; i < chain_len - 1; i++) { + Arg params; + params.add_input(prev.get_ref(0)); + params.add_output(ci); + prev = pto2_rt_submit_aiv_task(FUNC_INC, params); + } + + // Last task: intermediate_{N-2} -> result + Arg last_params; + last_params.add_input(prev.get_ref(0)); + last_params.add_inout(result); + pto2_rt_submit_aiv_task(FUNC_INC, last_params); + } + + LOG_ALWAYS("[chain_N] Submitted %d chained tasks", chain_len); +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py new file mode 100644 index 00000000..1147a734 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py @@ -0,0 +1,80 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for diamond test (fork-join topology). + +Diamond DAG: A -> {B_0, B_1, ..., B_{W-1}} -> D + + seed(0.0) -> [Source A] -> a_out(1.0) -> [Branch B_0] -> b_out_0(2.0) -. + -> [Branch B_1] -> b_out_1(2.0) -+-> [Merge D] -> result(1.0) + -> ... | + -> [Branch B_{W-1}] -> ... -' + +Source A and branches each add 1.0 to their input. Merge D increments +result from 0.0 to 1.0 (using the noop kernel, which only touches result). + +Three branch modes: + mode=0: All AIV branches + mode=1: All AIC branches + mode=2: Mixed AIC+AIV branches (even=AIC, odd=AIV) + +Cases sweep branch width (2/4/8/15) x mode (AIV/AIC/mixed). + +Args layout: [seed, result, width, mode] +""" + +import ctypes + +import torch + +__outputs__ = ["result"] + +RTOL = 1e-5 +ATOL = 1e-5 + +ALL_CASES = { + # All-AIV branches + "W2_AIV": {"width": 2, "mode": 0}, + "W4_AIV": {"width": 4, "mode": 0}, + "W8_AIV": {"width": 8, "mode": 0}, + "W15_AIV": {"width": 15, "mode": 0}, + # All-AIC branches + "W2_AIC": {"width": 2, "mode": 1}, + "W4_AIC": {"width": 4, "mode": 1}, + "W8_AIC": {"width": 8, "mode": 1}, + "W15_AIC": {"width": 15, "mode": 1}, + # Mixed AIC+AIV branches + "W2_Mixed": {"width": 2, "mode": 2}, + "W4_Mixed": {"width": 4, "mode": 2}, + "W8_Mixed": {"width": 8, "mode": 2}, + "W15_Mixed": {"width": 15, "mode": 2}, +} + +DEFAULT_CASE = "W15_AIV" + + +def generate_inputs(params: dict) -> list: + width = params["width"] + mode = params["mode"] + + seed = torch.zeros(1, dtype=torch.float32) + result = torch.zeros(1, dtype=torch.float32) + + return [ + ("seed", seed), + ("result", result), + ("width", ctypes.c_int64(width)), + ("mode", ctypes.c_int64(mode)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + result = torch.as_tensor(tensors["result"]) + # Merge D increments result from 0.0 to 1.0 + result[0] = 1.0 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp new file mode 100644 index 00000000..823dd38e --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp @@ -0,0 +1,43 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Increment AIC Kernel for Diamond Topology Test + * + * Minimal cube kernel: reads one input scalar, writes output = input + 1.0. + * AIC counterpart of kernel_inc_aiv.cpp for mixed AIC+AIV branch testing. + * + * Args: + * args[0] = input tensor (INPUT) - single float32 element + * args[1] = output tensor (OUTPUT/INOUT) - single float32 element + */ + +#include +#include + +#include "tensor.h" + +using namespace pto; // NOLINT(build/namespaces) + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] // NOLINT(whitespace/braces) +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) { + __gm__ Tensor* in_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]); + __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[1]); + __gm__ float* in_ptr = reinterpret_cast<__gm__ float*>(in_tensor->buffer.addr) + in_tensor->start_offset; + __gm__ float* out_ptr = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset; + *out_ptr = *in_ptr + 1.0f; +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py new file mode 100644 index 00000000..649f7040 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py @@ -0,0 +1,57 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for diamond test (tensormap_and_ringbuffer). + +Fork-join topology: A -> {B_0, ..., B_{W-1}} -> D. +Supports all-AIV, all-AIC, and mixed AIC+AIV branch modes. + +Kernels: + func_id=0: kernel_inc_aic (AIC) - reads input, writes output = input + 1.0 + func_id=1: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0 + func_id=2: kernel_noop_aiv (AIV) - increments INOUT by 1.0 (merge kernel) +""" + +from pathlib import Path + +_KERNELS_ROOT = Path(__file__).parent +_CHAIN_KERNELS = _KERNELS_ROOT / ".." / ".." / "graph-chain_n" / "kernels" + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "diamond_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "INC_AIC", + "source": str(_KERNELS_ROOT / "aic" / "kernel_inc_aic.cpp"), + "core_type": "aic", + }, + { + "func_id": 1, + "name": "INC_AIV", + "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"), + "core_type": "aiv", + }, + { + "func_id": 2, + "name": "NOOP_AIV", + "source": str(_CHAIN_KERNELS / "aiv" / "kernel_noop_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp new file mode 100644 index 00000000..78ebb8de --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp @@ -0,0 +1,109 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Diamond (Fork-Join) Orchestration (tensormap_and_ringbuffer Runtime) + * + * Builds a diamond DAG: A -> {B_0, B_1, ..., B_{W-1}} -> D + * + * seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -. + * -> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result + * -> ... | + * -> [Branch B_{W-1}] -> b_out_{W-1} -' + * + * Source A is always AIV. Merge D is always AIV (NOOP kernel). + * Branch tasks vary by mode: + * mode=0: All AIV branches + * mode=1: All AIC branches + * mode=2: Alternating AIC/AIV branches (even=AIC, odd=AIV) + * + * Tests: fan-out + fan-in combined, the most common real DAG pattern. + * With mixed modes, also tests cross-core-type dependency coordination. + * + * Arg layout: [seed, result, width, mode] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_INC_AIC 0 +#define FUNC_INC_AIV 1 +#define FUNC_NOOP_AIV 2 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 4, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor seed = from_tensor_arg(orch_args.tensor(0)); + Tensor result = from_tensor_arg(orch_args.tensor(1)); + int width = static_cast(orch_args.scalar(0)); + int mode = static_cast(orch_args.scalar(1)); + + LOG_ALWAYS("[diamond] width=%d, mode=%d (0=AIV, 1=AIC, 2=mixed)", width, mode); + + uint32_t scalar_shape[1] = {1}; + TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32); + + // Source A (always AIV): seed -> a_out + Arg src_params; + src_params.add_input(seed); + src_params.add_output(ci); + TaskOutputTensors source_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, src_params); + + // Build merge args incrementally + Arg merge_params; + merge_params.add_inout(result); + + // Branch tasks: each reads source output, produces branch output + for (int i = 0; i < width; i++) { + Arg bp; + bp.add_input(source_outs.get_ref(0)); + bp.add_output(ci); + + TaskOutputTensors branch_outs; + if (mode == 0) { + // All AIV + branch_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, bp); + } else if (mode == 1) { + // All AIC + branch_outs = pto2_rt_submit_aic_task(FUNC_INC_AIC, bp); + } else { + // Mixed: even=AIC, odd=AIV + if (i % 2 == 0) { + branch_outs = pto2_rt_submit_aic_task(FUNC_INC_AIC, bp); + } else { + branch_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, bp); + } + } + + merge_params.add_input(branch_outs.get_ref(0)); + } + + // Merge D (always AIV): waits for all branches, increments result + pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, merge_params); + + LOG_ALWAYS("[diamond] Submitted 1 source + %d branches + 1 merge task", width); +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py new file mode 100644 index 00000000..653f69fe --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py @@ -0,0 +1,65 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for fanin_N test (convergence barrier topology). + +N independent producer tasks converge into 1 barrier task: + seed(0.0) -> [Producer_0] -> prod_out_0 -. + seed(0.0) -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result = 1.0 + ... | + seed(0.0) -> [Producer_{N-1}] -> prod_out_{N-1} -' + +Each producer writes to an independent runtime tensor (no inter-producer deps). +The barrier task depends on all N producer outputs (via INPUT args) and writes +to the result tensor (INOUT), adding 1.0. + +Tests: dependency convergence overhead — how efficiently the runtime tracks +N predecessors for a single barrier task. + +Cases sweep fan-in width: 2, 4, 8, 15. + +Args layout: [seed, result, fanin_width] +""" + +import ctypes + +import torch + +__outputs__ = ["result"] + +RTOL = 1e-5 +ATOL = 1e-5 + +ALL_CASES = { + "Fanin2": {"fanin_width": 2}, + "Fanin4": {"fanin_width": 4}, + "Fanin8": {"fanin_width": 8}, + "Fanin15": {"fanin_width": 15}, +} + +DEFAULT_CASE = "Fanin15" + + +def generate_inputs(params: dict) -> list: + fanin_width = params["fanin_width"] + + seed = torch.zeros(1, dtype=torch.float32) + result = torch.zeros(1, dtype=torch.float32) + + return [ + ("seed", seed), + ("result", result), + ("fanin_width", ctypes.c_int64(fanin_width)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + result = torch.as_tensor(tensors["result"]) + # Barrier increments result from 0.0 to 1.0 + result[0] = 1.0 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py new file mode 100644 index 00000000..1c84467d --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py @@ -0,0 +1,49 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for fanin_N test (tensormap_and_ringbuffer). + +Fan-in topology: N producers -> 1 barrier. +Reuses chain_N's AIV kernels for both increment and noop operations. + +Kernels: + func_id=0: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0 + func_id=1: kernel_noop_aiv (AIV) - increments INOUT by 1.0 (barrier kernel) +""" + +from pathlib import Path + +_CHAIN_KERNELS = Path(__file__).parent / ".." / ".." / "graph-chain_n" / "kernels" + +ORCHESTRATION = { + "source": str(Path(__file__).parent / "orchestration" / "fanin_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "INC", + "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"), + "core_type": "aiv", + }, + { + "func_id": 1, + "name": "NOOP", + "source": str(_CHAIN_KERNELS / "aiv" / "kernel_noop_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp new file mode 100644 index 00000000..c140342c --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp @@ -0,0 +1,86 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Fan-In Orchestration (tensormap_and_ringbuffer Runtime) + * + * Builds a fan-in DAG: N independent producer tasks -> 1 barrier task. + * + * seed -> [Producer_0] -> prod_out_0 -. + * seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result + * ... | + * seed -> [Producer_{N-1}] -> prod_out_{N-1} -' + * + * Each producer reads seed (INPUT) and writes to an independent runtime + * tensor (OUTPUT). The barrier task reads all N producer outputs (INPUT + * for dependency tracking) and writes to result (INOUT). + * + * The barrier kernel (FUNC_NOOP) only uses args[0] (result INOUT). + * Producer output refs at args[1..N] are unused by the kernel but create + * runtime dependencies that force the barrier to wait for all producers. + * + * Tests: dependency convergence overhead, tracking N predecessors efficiently. + * + * Arg layout: [seed, result, fanin_width] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_INC 0 +#define FUNC_NOOP 1 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 3, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor seed = from_tensor_arg(orch_args.tensor(0)); + Tensor result = from_tensor_arg(orch_args.tensor(1)); + int fanin_width = static_cast(orch_args.scalar(0)); + + LOG_ALWAYS("[fanin_N] fanin_width=%d", fanin_width); + + uint32_t scalar_shape[1] = {1}; + TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32); + + // Build barrier args incrementally: result (INOUT) + all producer outputs (INPUT) + Arg barrier_params; + barrier_params.add_inout(result); + + // Submit N independent producers, collecting their output refs + for (int i = 0; i < fanin_width; i++) { + Arg p; + p.add_input(seed); + p.add_output(ci); + TaskOutputTensors outs = pto2_rt_submit_aiv_task(FUNC_INC, p); + barrier_params.add_input(outs.get_ref(0)); + } + + // Barrier task: waits for all producers, then increments result + pto2_rt_submit_aiv_task(FUNC_NOOP, barrier_params); + + LOG_ALWAYS("[fanin_N] Submitted %d producers + 1 barrier task", fanin_width); +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py new file mode 100644 index 00000000..a1eb11fe --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py @@ -0,0 +1,70 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Golden script for fanout_N test (wide fan-out topology). + +1 source task fans out to N independent consumer tasks: + seed(0.0) -> [Source] -> intermediate(1.0) -> [Consumer_0] -> result[0] = 2.0 + -> [Consumer_1] -> result[1] = 2.0 + -> ... + -> [Consumer_{N-1}] -> result[N-1] = 2.0 + +Tests parallel dispatch capability: can the runtime simultaneously issue +N independent tasks that all read from the same source output? + +Consumer output slots are cache-line aligned (64B = 16 float32 elements) +to avoid false sharing. + +Cases sweep fan-out width: 2, 4, 8, 15. + +Args layout: [seed, result, fanout_width] +""" + +import ctypes + +import torch + +__outputs__ = ["result"] + +RTOL = 1e-5 +ATOL = 1e-5 + +CACHE_LINE_ELEMS = 16 + +ALL_CASES = { + "Fanout2": {"fanout_width": 2}, + "Fanout4": {"fanout_width": 4}, + "Fanout8": {"fanout_width": 8}, + "Fanout15": {"fanout_width": 15}, +} + +DEFAULT_CASE = "Fanout15" + + +def generate_inputs(params: dict) -> list: + fanout_width = params["fanout_width"] + + seed = torch.zeros(1, dtype=torch.float32) + result = torch.zeros(fanout_width * CACHE_LINE_ELEMS, dtype=torch.float32) + + return [ + ("seed", seed), + ("result", result), + ("fanout_width", ctypes.c_int64(fanout_width)), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + fanout_width = params["fanout_width"] + result = torch.as_tensor(tensors["result"]) + + # Source output = seed(0.0) + 1.0 = 1.0 + # Each consumer output = source(1.0) + 1.0 = 2.0 + for i in range(fanout_width): + result[i * CACHE_LINE_ELEMS] = 2.0 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py new file mode 100644 index 00000000..4e30a29c --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py @@ -0,0 +1,42 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +""" +Kernel configuration for fanout_N test (tensormap_and_ringbuffer). + +Fan-out topology: 1 source -> N consumers. +Reuses chain_N's AIV increment kernel. + +Kernels: + func_id=0: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0 +""" + +from pathlib import Path + +_CHAIN_KERNELS = Path(__file__).parent / ".." / ".." / "graph-chain_n" / "kernels" + +ORCHESTRATION = { + "source": str(Path(__file__).parent / "orchestration" / "fanout_orch.cpp"), + "function_name": "aicpu_orchestration_entry", +} + +KERNELS = [ + { + "func_id": 0, + "name": "INC", + "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"), + "core_type": "aiv", + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp new file mode 100644 index 00000000..20824f62 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp @@ -0,0 +1,86 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Fan-Out Orchestration (tensormap_and_ringbuffer Runtime) + * + * Builds a fan-out DAG: 1 source task -> N independent consumer tasks. + * + * seed -> [Source] -> intermediate -> [Consumer_0] -> result[0] + * -> [Consumer_1] -> result[1] + * -> ... + * -> [Consumer_{N-1}] -> result[N-1] + * + * Source produces a runtime tensor via OUTPUT. All N consumers read + * that tensor (INPUT) and write to separate cache-line-aligned slots + * in the result tensor (INOUT). + * + * Tests: parallel dispatch capability, core utilization with N independent + * ready-to-run tasks. + * + * Arg layout: [seed, result, fanout_width] + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_INC 0 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs& orch_args) { + (void)orch_args; + return PTO2OrchestrationConfig{ + .expected_arg_count = 3, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) { + if (orch_thread_index != 0) { + return; + } + + Tensor seed = from_tensor_arg(orch_args.tensor(0)); + Tensor result = from_tensor_arg(orch_args.tensor(1)); + int fanout_width = static_cast(orch_args.scalar(0)); + + LOG_ALWAYS("[fanout_N] fanout_width=%d", fanout_width); + + uint32_t scalar_shape[1] = {1}; + TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32); + + // Source task: seed -> intermediate + Arg src_params; + src_params.add_input(seed); + src_params.add_output(ci); + TaskOutputTensors source_outs = pto2_rt_submit_aiv_task(FUNC_INC, src_params); + + // Consumer tasks: each reads source output, writes to separate result slot + constexpr uint32_t CACHE_LINE_ELEMS = 16; + uint32_t slot_shape[1] = {1}; + + for (int i = 0; i < fanout_width; i++) { + uint32_t view_offsets[1] = {static_cast(i * CACHE_LINE_ELEMS)}; + Tensor result_slot = result.view(slot_shape, view_offsets); + + Arg params; + params.add_input(source_outs.get_ref(0)); + params.add_inout(result_slot); + pto2_rt_submit_aiv_task(FUNC_INC, params); + } + + LOG_ALWAYS("[fanout_N] Submitted 1 source + %d consumer tasks", fanout_width); +} + +} // extern "C" diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index 64b283e8..43e86e5d 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -1,4 +1,12 @@ #!/usr/bin/env bash +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- # Benchmark wrapper: run examples on hardware, # then parse device-log timing lines to report per-round latency. # @@ -25,12 +33,24 @@ declare -A TMR_EXAMPLE_CASES=( [benchmark_bgemm]="" [paged_attention_unroll]="Case1,Case2" [batch_paged_attention]="" + [dispatch-independent]="" + [dispatch-serial]="" + [graph-chain_n]="" + [graph-fanin_n]="" + [graph-fanout_n]="" + [graph-diamond]="" ) TMR_EXAMPLE_ORDER=( alternating_matmul_add benchmark_bgemm paged_attention_unroll batch_paged_attention + dispatch-independent + dispatch-serial + graph-chain_n + graph-fanin_n + graph-fanout_n + graph-diamond ) # --- aicpu_build_graph ---