hw-native-sys · chenshengxin2026 · Mar 31, 2026
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md b/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md
@@ -0,0 +1,137 @@
+# Benchmark Scene Tests
+
+This directory contains benchmark scene tests for the `tensormap_and_ringbuffer` runtime on the A2/A3 platform. These tests are designed to systematically characterize runtime performance across two dimensions: **dispatch overhead** and **graph topology**.
+
+All tests use trivial kernels (noop or increment-by-one) to isolate runtime scheduling overhead from compute. Results are collected via `tools/benchmark_rounds.sh`.
+
+## Scene 1: Dispatch & Scheduling Overhead
+
+These tests isolate and quantify the runtime's "scheduling tax" — framework overhead independent of kernel computation.
+
+### dispatch-independent (Task Scaling)
+
+**Intent**: Measure how dispatch overhead grows with task count when tasks are fully independent (no inter-task data dependencies).
+
+Each task writes `1.0` to its own cache-line-aligned slot (stride = 16 float32 = 64 bytes) in a shared output tensor, avoiding false sharing across non-coherent AICore L1 caches.
+
+| Parameter | Values |
+| --------- | ------ |
+| num_tasks | 100, 500, 1000, 2000 |
+| mode | AIC-only, AIV-only, AIC+AIV alternating |
+
+**What to look for**: Linear growth in total dispatch time vs. task count. Super-linear growth indicates a scheduling bottleneck (e.g., O(N^2) dependency tracking).
+
+### dispatch-serial (Dispatch Throughput)
+
+**Intent**: Measure maximum scheduler throughput under serial task submission with accumulation dependencies.
+
+All N tasks write to the same counter (AIC counter or AIV counter), forming a serial dependency chain. The final counter value equals N, validating correctness.
+
+| Parameter | Values |
+| --------- | ------ |
+| num_tasks | 100, 500, 1000, 2000 |
+| mode | AIC-only, AIV-only, AIC+AIV alternating |
+
+**What to look for**: Per-task dispatch latency (total time / N). Compare with `dispatch-independent` to quantify the overhead of serial dependencies vs. independent dispatch.
+
+## Scene 2: Graph Topology Patterns
+
+These tests stress-test the scheduler with different DAG dependency structures. Each topology exercises a different aspect of dependency resolution.
+
+### graph-chain_n (Linear Chain)
+
+**Intent**: Measure serial dependency resolution overhead as chain length increases.
+
+```text
+seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result(N.0)
+```
+
+Each task is an AIV increment kernel (`out = in + 1.0`). The result equals the chain length, validating every link executed.
+
+| Parameter | Values |
+| --------- | ------ |
+| chain_len | 4, 8, 16, 32, 64 |
+
+**What to look for**: End-to-end latency vs. chain length. Ideally linear; deviation reveals per-hop scheduling overhead.
+
+### graph-fanout_n (Wide Fan-Out)
+
+**Intent**: Test parallel dispatch capability — can the runtime simultaneously issue N independent tasks from a single source?
+
+```text
+seed -> [Source] -> intermediate -> [Consumer_0] -> result[0]
+                                 -> [Consumer_1] -> result[1]
+                                 -> ...
+                                 -> [Consumer_{N-1}] -> result[N-1]
+```
+
+Consumer output slots are cache-line-aligned to avoid false sharing. Each consumer reads the same source output and writes `source + 1.0`.
+
+| Parameter | Values |
+| --------- | ------ |
+| fanout_width | 2, 4, 8, 15 |
+
+**What to look for**: Whether fan-out width impacts total latency. Ideal runtime dispatches all consumers in parallel, so latency should plateau rather than grow linearly.
+
+### graph-fanin_n (Convergence Barrier)
+
+**Intent**: Measure dependency convergence overhead — how efficiently the runtime tracks N predecessors for a single barrier task.
+
+```text
+seed -> [Producer_0] -> prod_out_0 -.
+seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result(1.0)
+...                                 |
+seed -> [Producer_{N-1}] -> ...    -'
+```
+
+Each producer writes independently; the barrier depends on all N producer outputs.
+
+| Parameter | Values |
+| --------- | ------ |
+| fanin_width | 2, 4, 8, 15 |
+
+**What to look for**: Barrier wait overhead vs. fan-in width. Measures the cost of tracking and synchronizing N predecessor completions.
+
+### graph-diamond (Fork-Join)
+
+**Intent**: Test the most common real-world DAG pattern — fan-out followed by fan-in (fork-join).
+
+```text
+seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -.
+                            -> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result(1.0)
+                            -> ...                      |
+                            -> [Branch B_{W-1}] -> ... -'
+```
+
+Three branch modes exercise different core-type scheduling paths:
+
+- **mode=0**: All AIV branches
+- **mode=1**: All AIC branches
+- **mode=2**: Mixed AIC+AIV (even=AIC, odd=AIV)
+
+| Parameter | Values |
+| --------- | ------ |
+| width | 2, 4, 8, 15 |
+| mode | AIV-only, AIC-only, Mixed AIC+AIV |
+
+**What to look for**: Combined fan-out + fan-in overhead. Compare with isolated fanout/fanin tests to check for compounding effects. Mixed mode reveals cross-core-type scheduling costs.
+
+## Also Updated: benchmark_bgemm
+
+The existing `benchmark_bgemm` test was extended with structured parameter sweeps:
+
+- **Tile size sweep** (16, 32, 64, 128) at fixed batch and grid_k
+- **Batch/group sweep** (1, 4, 16, 64 groups) at fixed tile size
+- **Grid-K sweep** (1, 2, 4) at fixed tile and batch
+
+These complement the original 5 cases with systematic single-variable sweeps for identifying performance cliffs.
+
+## Running
+
+```bash
+# Run all benchmark scene tests (100 rounds each, default)
+./tools/benchmark_rounds.sh
+
+# Customize
+./tools/benchmark_rounds.sh -n 50 -d 0 -p a2a3 -r tensormap_and_ringbuffer -v
+```
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py
@@ -1,3 +1,11 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
 """
 Golden test specification for BGEMM (tensormap_and_ringbuffer Runtime).
 
@@ -24,49 +32,81 @@
 SUPPORTED_INCORE_DATA_SIZES = {16, 32, 64, 128}
 
 ALL_CASES = {
-    "Case0": {
+    "Case1": {
         "matmul_add_task_num": 500,
         "incore_task_granularity": {
             "incore_data_size": 128,
             "incore_loop": 4,
         },
         "grid_k": 2,
     },
-    "Case1": {
-        "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+    # --- Tile Size Sweep (fixed: num_groups=16, grid_k=2, incore_loop=4) ---
+    "Tile16": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 16, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case2": {
-        "matmul_add_task_num": 256,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+    "Tile32": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 32, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case3": {
-        "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 16,
-        },
+    "Tile64": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 64, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Tile128": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case4": {
+    # --- Batch/Group Sweep (fixed: tile=128, grid_k=2, incore_loop=4) ---
+    "Batch1": {
+        "matmul_add_task_num": 2,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Batch4": {
+        "matmul_add_task_num": 8,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Batch64": {
+        "matmul_add_task_num": 128,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    # --- K Dimension Sweep (fixed: tile=128, num_groups=16, incore_loop=4) ---
+    "K1": {
+        "matmul_add_task_num": 16,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 1,
+    },
+    "K4": {
         "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
         "grid_k": 4,
     },
+    "K8": {
+        "matmul_add_task_num": 128,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 8,
+    },
+    # --- In-Core Loop Sweep (fixed: tile=128, num_groups=16, grid_k=2) ---
+    "Loop1": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 1},
+        "grid_k": 2,
+    },
+    "Loop16": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 16},
+        "grid_k": 2,
+    },
 }
 
-DEFAULT_CASE = "Case0"
+DEFAULT_CASE = "Case1"
 
 
 def generate_inputs(params: dict) -> list:
@@ -80,18 +120,14 @@ def generate_inputs(params: dict) -> list:
     # --- constraint checks ---
     if tile_size not in SUPPORTED_INCORE_DATA_SIZES:
         raise ValueError(
-            f"incore_data_size={tile_size} is not supported. "
-            f"Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
+            f"incore_data_size={tile_size} is not supported. Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
         )
     if incore_loop <= 0:
         raise ValueError(f"incore_loop must be positive, got {incore_loop}")
     if grid_k <= 0:
         raise ValueError(f"grid_k must be positive, got {grid_k}")
     if matmul_add_task_num % grid_k != 0:
-        raise ValueError(
-            f"matmul_add_task_num ({matmul_add_task_num}) must be "
-            f"divisible by grid_k ({grid_k})."
-        )
+        raise ValueError(f"matmul_add_task_num ({matmul_add_task_num}) must be divisible by grid_k ({grid_k}).")
 
     num_groups = matmul_add_task_num // grid_k
 

diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py
@@ -0,0 +1,79 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for task_scaling test.
+
+Measures dispatch overhead vs task count. Submits N independent noop tasks,
+each writing 1.0 to a separate cache-line-aligned slot. Output tensor is
+padded so each task's slot sits on its own cache line (stride = 16 float32
+elements = 64 bytes), avoiding false sharing across non-coherent AICore L1
+caches.
+
+Cases parameterize task count (100→2000) and core type:
+  AIC-only sweep:  100, 500, 1000, 2000 tasks
+  AIV-only sweep:  100, 500, 1000, 2000 tasks
+  AIC+AIV sweep:   100, 500, 1000, 2000 tasks
+
+Args layout: [output, num_tasks, mode]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["output"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+# Each task writes to a separate cache line to avoid false sharing
+# across non-coherent AICore L1 caches (64B = 16 float32 elements).
+CACHE_LINE_ELEMS = 16
+
+ALL_CASES = {
+    # AIC-only (mode=0)
+    "Case1": {"num_tasks": 100, "mode": 0},
+    "Case2": {"num_tasks": 500, "mode": 0},
+    "Case3": {"num_tasks": 1000, "mode": 0},
+    "Case4": {"num_tasks": 2000, "mode": 0},
+    # AIV-only (mode=1)
+    "Case5": {"num_tasks": 100, "mode": 1},
+    "Case6": {"num_tasks": 500, "mode": 1},
+    "Case7": {"num_tasks": 1000, "mode": 1},
+    "Case8": {"num_tasks": 2000, "mode": 1},
+    # AIC+AIV alternating (mode=2)
+    "Case9": {"num_tasks": 100, "mode": 2},
+    "Case10": {"num_tasks": 500, "mode": 2},
+    "Case11": {"num_tasks": 1000, "mode": 2},
+    "Case12": {"num_tasks": 2000, "mode": 2},
+}
+
+DEFAULT_CASE = "Case2"
+
+
+def generate_inputs(params: dict) -> list:
+    num_tasks = params["num_tasks"]
+    mode = params["mode"]
+
+    output = torch.zeros(num_tasks * CACHE_LINE_ELEMS, dtype=torch.float32)
+
+    return [
+        ("output", output),
+        ("num_tasks", ctypes.c_int64(num_tasks)),
+        ("mode", ctypes.c_int64(mode)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    num_tasks = params["num_tasks"]
+    output = torch.as_tensor(tensors["output"])
+
+    # Each independent task writes 1.0 to its cache-line-aligned slot
+    for i in range(num_tasks):
+        output[i * CACHE_LINE_ELEMS] = 1.0
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIC Kernel for Task Scaling
+ *
+ * Minimal cube kernel that performs a trivial write. Each task writes 1.0
+ * at its designated position in the output tensor, proving execution order.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element per task
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = 1.0f;
+}