diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md b/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md
new file mode 100644
index 00000000..7ca172c0
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md
@@ -0,0 +1,137 @@
+# Benchmark Scene Tests
+
+This directory contains benchmark scene tests for the `tensormap_and_ringbuffer` runtime on the A2/A3 platform. These tests are designed to systematically characterize runtime performance across two dimensions: **dispatch overhead** and **graph topology**.
+
+All tests use trivial kernels (noop or increment-by-one) to isolate runtime scheduling overhead from compute. Results are collected via `tools/benchmark_rounds.sh`.
+
+## Scene 1: Dispatch & Scheduling Overhead
+
+These tests isolate and quantify the runtime's "scheduling tax" — framework overhead independent of kernel computation.
+
+### dispatch-independent (Task Scaling)
+
+**Intent**: Measure how dispatch overhead grows with task count when tasks are fully independent (no inter-task data dependencies).
+
+Each task writes `1.0` to its own cache-line-aligned slot (stride = 16 float32 = 64 bytes) in a shared output tensor, avoiding false sharing across non-coherent AICore L1 caches.
+
+| Parameter | Values |
+| --------- | ------ |
+| num_tasks | 100, 500, 1000, 2000 |
+| mode | AIC-only, AIV-only, AIC+AIV alternating |
+
+**What to look for**: Linear growth in total dispatch time vs. task count. Super-linear growth indicates a scheduling bottleneck (e.g., O(N^2) dependency tracking).
+
+### dispatch-serial (Dispatch Throughput)
+
+**Intent**: Measure maximum scheduler throughput under serial task submission with accumulation dependencies.
+
+All N tasks write to the same counter (AIC counter or AIV counter), forming a serial dependency chain. The final counter value equals N, validating correctness.
+
+| Parameter | Values |
+| --------- | ------ |
+| num_tasks | 100, 500, 1000, 2000 |
+| mode | AIC-only, AIV-only, AIC+AIV alternating |
+
+**What to look for**: Per-task dispatch latency (total time / N). Compare with `dispatch-independent` to quantify the overhead of serial dependencies vs. independent dispatch.
+
+## Scene 2: Graph Topology Patterns
+
+These tests stress-test the scheduler with different DAG dependency structures. Each topology exercises a different aspect of dependency resolution.
+
+### graph-chain_n (Linear Chain)
+
+**Intent**: Measure serial dependency resolution overhead as chain length increases.
+
+```text
+seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result(N.0)
+```
+
+Each task is an AIV increment kernel (`out = in + 1.0`). The result equals the chain length, validating every link executed.
+
+| Parameter | Values |
+| --------- | ------ |
+| chain_len | 4, 8, 16, 32, 64 |
+
+**What to look for**: End-to-end latency vs. chain length. Ideally linear; deviation reveals per-hop scheduling overhead.
+
+### graph-fanout_n (Wide Fan-Out)
+
+**Intent**: Test parallel dispatch capability — can the runtime simultaneously issue N independent tasks from a single source?
+
+```text
+seed -> [Source] -> intermediate -> [Consumer_0] -> result[0]
+                                 -> [Consumer_1] -> result[1]
+                                 -> ...
+                                 -> [Consumer_{N-1}] -> result[N-1]
+```
+
+Consumer output slots are cache-line-aligned to avoid false sharing. Each consumer reads the same source output and writes `source + 1.0`.
+
+| Parameter | Values |
+| --------- | ------ |
+| fanout_width | 2, 4, 8, 15 |
+
+**What to look for**: Whether fan-out width impacts total latency. Ideal runtime dispatches all consumers in parallel, so latency should plateau rather than grow linearly.
+
+### graph-fanin_n (Convergence Barrier)
+
+**Intent**: Measure dependency convergence overhead — how efficiently the runtime tracks N predecessors for a single barrier task.
+
+```text
+seed -> [Producer_0] -> prod_out_0 -.
+seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result(1.0)
+...                                 |
+seed -> [Producer_{N-1}] -> ...    -'
+```
+
+Each producer writes independently; the barrier depends on all N producer outputs.
+
+| Parameter | Values |
+| --------- | ------ |
+| fanin_width | 2, 4, 8, 15 |
+
+**What to look for**: Barrier wait overhead vs. fan-in width. Measures the cost of tracking and synchronizing N predecessor completions.
+
+### graph-diamond (Fork-Join)
+
+**Intent**: Test the most common real-world DAG pattern — fan-out followed by fan-in (fork-join).
+
+```text
+seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -.
+                            -> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result(1.0)
+                            -> ...                      |
+                            -> [Branch B_{W-1}] -> ... -'
+```
+
+Three branch modes exercise different core-type scheduling paths:
+
+- **mode=0**: All AIV branches
+- **mode=1**: All AIC branches
+- **mode=2**: Mixed AIC+AIV (even=AIC, odd=AIV)
+
+| Parameter | Values |
+| --------- | ------ |
+| width | 2, 4, 8, 15 |
+| mode | AIV-only, AIC-only, Mixed AIC+AIV |
+
+**What to look for**: Combined fan-out + fan-in overhead. Compare with isolated fanout/fanin tests to check for compounding effects. Mixed mode reveals cross-core-type scheduling costs.
+
+## Also Updated: benchmark_bgemm
+
+The existing `benchmark_bgemm` test was extended with structured parameter sweeps:
+
+- **Tile size sweep** (16, 32, 64, 128) at fixed batch and grid_k
+- **Batch/group sweep** (1, 4, 16, 64 groups) at fixed tile size
+- **Grid-K sweep** (1, 2, 4) at fixed tile and batch
+
+These complement the original 5 cases with systematic single-variable sweeps for identifying performance cliffs.
+
+## Running
+
+```bash
+# Run all benchmark scene tests (100 rounds each, default)
+./tools/benchmark_rounds.sh
+
+# Customize
+./tools/benchmark_rounds.sh -n 50 -d 0 -p a2a3 -r tensormap_and_ringbuffer -v
+```
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py
index 444b2997..3e698cf4 100644
--- a/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py
@@ -1,3 +1,11 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
 """
 Golden test specification for BGEMM (tensormap_and_ringbuffer Runtime).
 
@@ -24,7 +32,7 @@
 SUPPORTED_INCORE_DATA_SIZES = {16, 32, 64, 128}
 
 ALL_CASES = {
-    "Case0": {
+    "Case1": {
         "matmul_add_task_num": 500,
         "incore_task_granularity": {
             "incore_data_size": 128,
@@ -32,41 +40,73 @@
         },
         "grid_k": 2,
     },
-    "Case1": {
-        "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+    # --- Tile Size Sweep (fixed: num_groups=16, grid_k=2, incore_loop=4) ---
+    "Tile16": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 16, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case2": {
-        "matmul_add_task_num": 256,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+    "Tile32": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 32, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case3": {
-        "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 16,
-        },
+    "Tile64": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 64, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Tile128": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
         "grid_k": 2,
     },
-    "Case4": {
+    # --- Batch/Group Sweep (fixed: tile=128, grid_k=2, incore_loop=4) ---
+    "Batch1": {
+        "matmul_add_task_num": 2,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Batch4": {
+        "matmul_add_task_num": 8,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    "Batch64": {
+        "matmul_add_task_num": 128,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 2,
+    },
+    # --- K Dimension Sweep (fixed: tile=128, num_groups=16, incore_loop=4) ---
+    "K1": {
+        "matmul_add_task_num": 16,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 1,
+    },
+    "K4": {
         "matmul_add_task_num": 64,
-        "incore_task_granularity": {
-            "incore_data_size": 128,
-            "incore_loop": 4,
-        },
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
         "grid_k": 4,
     },
+    "K8": {
+        "matmul_add_task_num": 128,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
+        "grid_k": 8,
+    },
+    # --- In-Core Loop Sweep (fixed: tile=128, num_groups=16, grid_k=2) ---
+    "Loop1": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 1},
+        "grid_k": 2,
+    },
+    "Loop16": {
+        "matmul_add_task_num": 32,
+        "incore_task_granularity": {"incore_data_size": 128, "incore_loop": 16},
+        "grid_k": 2,
+    },
 }
 
-DEFAULT_CASE = "Case0"
+DEFAULT_CASE = "Case1"
 
 
 def generate_inputs(params: dict) -> list:
@@ -80,18 +120,14 @@ def generate_inputs(params: dict) -> list:
     # --- constraint checks ---
     if tile_size not in SUPPORTED_INCORE_DATA_SIZES:
         raise ValueError(
-            f"incore_data_size={tile_size} is not supported. "
-            f"Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
+            f"incore_data_size={tile_size} is not supported. Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
         )
     if incore_loop <= 0:
         raise ValueError(f"incore_loop must be positive, got {incore_loop}")
     if grid_k <= 0:
         raise ValueError(f"grid_k must be positive, got {grid_k}")
     if matmul_add_task_num % grid_k != 0:
-        raise ValueError(
-            f"matmul_add_task_num ({matmul_add_task_num}) must be "
-            f"divisible by grid_k ({grid_k})."
-        )
+        raise ValueError(f"matmul_add_task_num ({matmul_add_task_num}) must be divisible by grid_k ({grid_k}).")
 
     num_groups = matmul_add_task_num // grid_k
 
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py
new file mode 100644
index 00000000..d860d732
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/golden.py
@@ -0,0 +1,79 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for task_scaling test.
+
+Measures dispatch overhead vs task count. Submits N independent noop tasks,
+each writing 1.0 to a separate cache-line-aligned slot. Output tensor is
+padded so each task's slot sits on its own cache line (stride = 16 float32
+elements = 64 bytes), avoiding false sharing across non-coherent AICore L1
+caches.
+
+Cases parameterize task count (100→2000) and core type:
+  AIC-only sweep:  100, 500, 1000, 2000 tasks
+  AIV-only sweep:  100, 500, 1000, 2000 tasks
+  AIC+AIV sweep:   100, 500, 1000, 2000 tasks
+
+Args layout: [output, num_tasks, mode]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["output"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+# Each task writes to a separate cache line to avoid false sharing
+# across non-coherent AICore L1 caches (64B = 16 float32 elements).
+CACHE_LINE_ELEMS = 16
+
+ALL_CASES = {
+    # AIC-only (mode=0)
+    "Case1": {"num_tasks": 100, "mode": 0},
+    "Case2": {"num_tasks": 500, "mode": 0},
+    "Case3": {"num_tasks": 1000, "mode": 0},
+    "Case4": {"num_tasks": 2000, "mode": 0},
+    # AIV-only (mode=1)
+    "Case5": {"num_tasks": 100, "mode": 1},
+    "Case6": {"num_tasks": 500, "mode": 1},
+    "Case7": {"num_tasks": 1000, "mode": 1},
+    "Case8": {"num_tasks": 2000, "mode": 1},
+    # AIC+AIV alternating (mode=2)
+    "Case9": {"num_tasks": 100, "mode": 2},
+    "Case10": {"num_tasks": 500, "mode": 2},
+    "Case11": {"num_tasks": 1000, "mode": 2},
+    "Case12": {"num_tasks": 2000, "mode": 2},
+}
+
+DEFAULT_CASE = "Case2"
+
+
+def generate_inputs(params: dict) -> list:
+    num_tasks = params["num_tasks"]
+    mode = params["mode"]
+
+    output = torch.zeros(num_tasks * CACHE_LINE_ELEMS, dtype=torch.float32)
+
+    return [
+        ("output", output),
+        ("num_tasks", ctypes.c_int64(num_tasks)),
+        ("mode", ctypes.c_int64(mode)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    num_tasks = params["num_tasks"]
+    output = torch.as_tensor(tensors["output"])
+
+    # Each independent task writes 1.0 to its cache-line-aligned slot
+    for i in range(num_tasks):
+        output[i * CACHE_LINE_ELEMS] = 1.0
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp
new file mode 100644
index 00000000..ca0e7669
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aic/kernel_noop_aic.cpp
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIC Kernel for Task Scaling
+ *
+ * Minimal cube kernel that performs a trivial write. Each task writes 1.0
+ * at its designated position in the output tensor, proving execution order.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element per task
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp
new file mode 100644
index 00000000..e0e6d4ae
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/aiv/kernel_noop_aiv.cpp
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIV Kernel for Task Scaling
+ *
+ * Minimal vector kernel that performs a trivial write. Each task writes 1.0
+ * at its designated position in the output tensor, proving execution order.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element per task
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py
new file mode 100644
index 00000000..4973d254
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/kernel_config.py
@@ -0,0 +1,48 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for task_scaling test (tensormap_and_ringbuffer).
+
+Measures dispatch overhead growth as task count scales from 1 to 1000.
+
+Kernels:
+  func_id=0: kernel_noop_aic (AIC) - trivial write kernel
+  func_id=1: kernel_noop_aiv (AIV) - trivial write kernel
+"""
+
+from pathlib import Path
+
+_KERNELS_ROOT = Path(__file__).parent
+
+ORCHESTRATION = {
+    "source": str(_KERNELS_ROOT / "orchestration" / "task_scaling_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "NOOP_AIC",
+        "source": str(_KERNELS_ROOT / "aic" / "kernel_noop_aic.cpp"),
+        "core_type": "aic",
+    },
+    {
+        "func_id": 1,
+        "name": "NOOP_AIV",
+        "source": str(_KERNELS_ROOT / "aiv" / "kernel_noop_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp
new file mode 100644
index 00000000..658ce245
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-independent/kernels/orchestration/task_scaling_orch.cpp
@@ -0,0 +1,90 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Task Scaling Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Measures dispatch overhead growth as task count scales. Submits N
+ * independent noop tasks, each writing 1.0 to a separate slot in the
+ * output tensor. Tasks are independent (no inter-task data dependency)
+ * to isolate pure scheduling overhead from serialization effects.
+ *
+ * Three modes:
+ *   mode=0: AIC-only  — N independent AIC noop tasks
+ *   mode=1: AIV-only  — N independent AIV noop tasks
+ *   mode=2: AIC+AIV   — alternating AIC/AIV independent noop tasks
+ *
+ * Arg layout: [output, num_tasks, mode]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_NOOP_AIC 0
+#define FUNC_NOOP_AIV 1
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 3,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor output = from_tensor_arg(orch_args.tensor(0));
+    int num_tasks = static_cast<int>(orch_args.scalar(0));
+    int mode = static_cast<int>(orch_args.scalar(1));
+
+    LOG_ALWAYS("[task_scaling] num_tasks=%d, mode=%d (0=AIC, 1=AIV, 2=AIC+AIV)", num_tasks, mode);
+
+    // Each task writes to a separate cache line (64B = 16 float32 elements)
+    // to avoid false sharing across non-coherent AICore L1 caches.
+    constexpr uint32_t CACHE_LINE_ELEMS = 16;
+    uint32_t slot_shapes[1] = {1};
+
+    for (int i = 0; i < num_tasks; i++) {
+        uint32_t view_offsets[1] = {static_cast<uint32_t>(i * CACHE_LINE_ELEMS)};
+        Tensor slot = output.view(slot_shapes, view_offsets);
+
+        if (mode == 0) {
+            Arg params;
+            params.add_inout(slot);
+            pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params);
+        } else if (mode == 1) {
+            Arg params;
+            params.add_inout(slot);
+            pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params);
+        } else {
+            if (i % 2 == 0) {
+                Arg params;
+                params.add_inout(slot);
+                pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params);
+            } else {
+                Arg params;
+                params.add_inout(slot);
+                pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params);
+            }
+        }
+    }
+
+    LOG_ALWAYS("[task_scaling] Submitted %d independent tasks", num_tasks);
+}
+
+}  // extern "C"
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py
new file mode 100644
index 00000000..2a506f99
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/golden.py
@@ -0,0 +1,96 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for dispatch_throughput test.
+
+Measures scheduler throughput by submitting N noop tasks serially.
+Each task increments a counter, so the final output equals N (or N/2
+for each core type in AIC+AIV mode).
+
+Cases sweep across task counts and core types:
+  Case1:  100 AIC-only tasks
+  Case2:  500 AIC-only tasks
+  Case3: 1000 AIC-only tasks
+  Case4: 2000 AIC-only tasks
+  Case5:  100 AIV-only tasks
+  Case6:  500 AIV-only tasks
+  Case7: 1000 AIV-only tasks
+  Case8: 2000 AIV-only tasks
+  Case9:  100 AIC+AIV alternating tasks
+  Case10: 500 AIC+AIV alternating tasks
+  Case11:1000 AIC+AIV alternating tasks
+  Case12:2000 AIC+AIV alternating tasks
+
+Args layout: [out_aic, out_aiv, num_tasks, mode]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["out_aic", "out_aiv"]
+
+RTOL = 1e-3
+ATOL = 1e-1  # Accumulated float additions may drift slightly
+
+ALL_CASES = {
+    # AIC-only (mode=0)
+    "Case1": {"num_tasks": 100, "mode": 0},
+    "Case2": {"num_tasks": 500, "mode": 0},
+    "Case3": {"num_tasks": 1000, "mode": 0},
+    "Case4": {"num_tasks": 2000, "mode": 0},
+    # AIV-only (mode=1)
+    "Case5": {"num_tasks": 100, "mode": 1},
+    "Case6": {"num_tasks": 500, "mode": 1},
+    "Case7": {"num_tasks": 1000, "mode": 1},
+    "Case8": {"num_tasks": 2000, "mode": 1},
+    # AIC+AIV alternating (mode=2)
+    "Case9": {"num_tasks": 100, "mode": 2},
+    "Case10": {"num_tasks": 500, "mode": 2},
+    "Case11": {"num_tasks": 1000, "mode": 2},
+    "Case12": {"num_tasks": 2000, "mode": 2},
+}
+
+DEFAULT_CASE = "Case2"
+
+
+def generate_inputs(params: dict) -> list:
+    num_tasks = params["num_tasks"]
+    mode = params["mode"]
+
+    out_aic = torch.zeros(1, dtype=torch.float32)
+    out_aiv = torch.zeros(1, dtype=torch.float32)
+
+    return [
+        ("out_aic", out_aic),
+        ("out_aiv", out_aiv),
+        ("num_tasks", ctypes.c_int64(num_tasks)),
+        ("mode", ctypes.c_int64(mode)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    num_tasks = params["num_tasks"]
+    mode = params["mode"]
+
+    out_aic = torch.as_tensor(tensors["out_aic"])
+    out_aiv = torch.as_tensor(tensors["out_aiv"])
+
+    if mode == 0:
+        # AIC-only: all N tasks increment out_aic
+        out_aic[0] = float(num_tasks)
+    elif mode == 1:
+        # AIV-only: all N tasks increment out_aiv
+        out_aiv[0] = float(num_tasks)
+    elif mode == 2:
+        # AIC+AIV alternating: even tasks → AIC, odd tasks → AIV
+        aic_count = (num_tasks + 1) // 2
+        aiv_count = num_tasks // 2
+        out_aic[0] = float(aic_count)
+        out_aiv[0] = float(aiv_count)
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp
new file mode 100644
index 00000000..e5c13811
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aic/kernel_noop_aic.cpp
@@ -0,0 +1,41 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIC Kernel for Dispatch Throughput
+ *
+ * Minimal cube kernel that writes a single scalar to prove execution.
+ * The kernel reads the current accumulated value, adds 1.0, and writes back.
+ * With N tasks, the final output should be N.0.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = *out + 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp
new file mode 100644
index 00000000..2015e0ed
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/aiv/kernel_noop_aiv.cpp
@@ -0,0 +1,41 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIV Kernel for Dispatch Throughput
+ *
+ * Minimal vector kernel that writes a single scalar to prove execution.
+ * The kernel reads the current accumulated value, adds 1.0, and writes back.
+ * With N tasks, the final output should be N.0.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = *out + 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py
new file mode 100644
index 00000000..c6813dcd
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/kernel_config.py
@@ -0,0 +1,48 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for dispatch_throughput test (tensormap_and_ringbuffer).
+
+Measures scheduler throughput by submitting N noop tasks serially.
+
+Kernels:
+  func_id=0: kernel_noop_aic (AIC) - empty cube kernel, increments counter
+  func_id=1: kernel_noop_aiv (AIV) - empty vector kernel, increments counter
+"""
+
+from pathlib import Path
+
+_KERNELS_ROOT = Path(__file__).parent
+
+ORCHESTRATION = {
+    "source": str(_KERNELS_ROOT / "orchestration" / "dispatch_throughput_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "NOOP_AIC",
+        "source": str(_KERNELS_ROOT / "aic" / "kernel_noop_aic.cpp"),
+        "core_type": "aic",
+    },
+    {
+        "func_id": 1,
+        "name": "NOOP_AIV",
+        "source": str(_KERNELS_ROOT / "aiv" / "kernel_noop_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp
new file mode 100644
index 00000000..1817b994
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/dispatch-serial/kernels/orchestration/dispatch_throughput_orch.cpp
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Dispatch Throughput Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Measures scheduler throughput by submitting N noop tasks serially.
+ * Each task increments a counter by 1.0, so the final output equals N.
+ *
+ * Three modes:
+ *   mode=0: AIC-only  — N AIC noop tasks
+ *   mode=1: AIV-only  — N AIV noop tasks
+ *   mode=2: AIC+AIV   — alternating AIC/AIV noop tasks (N total)
+ *
+ * All tasks are chained through the same output tensor (INOUT) to enforce
+ * serial execution order — each task must wait for the previous one to
+ * complete before it can read the accumulated value.
+ *
+ * Arg layout: [out_aic, out_aiv, num_tasks, mode]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_NOOP_AIC 0
+#define FUNC_NOOP_AIV 1
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 4,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor out_aic = from_tensor_arg(orch_args.tensor(0));
+    Tensor out_aiv = from_tensor_arg(orch_args.tensor(1));
+    int num_tasks = static_cast<int>(orch_args.scalar(0));
+    int mode = static_cast<int>(orch_args.scalar(1));
+
+    LOG_ALWAYS("[dispatch_throughput] num_tasks=%d, mode=%d (0=AIC, 1=AIV, 2=AIC+AIV)", num_tasks, mode);
+
+    for (int i = 0; i < num_tasks; i++) {
+        if (mode == 0) {
+            Arg params;
+            params.add_inout(out_aic);
+            pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params);
+        } else if (mode == 1) {
+            Arg params;
+            params.add_inout(out_aiv);
+            pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params);
+        } else {
+            // Alternating AIC/AIV
+            if (i % 2 == 0) {
+                Arg params;
+                params.add_inout(out_aic);
+                pto2_rt_submit_aic_task(FUNC_NOOP_AIC, params);
+            } else {
+                Arg params;
+                params.add_inout(out_aiv);
+                pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, params);
+            }
+        }
+    }
+
+    LOG_ALWAYS("[dispatch_throughput] Submitted %d tasks", num_tasks);
+}
+
+}  // extern "C"
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py
new file mode 100644
index 00000000..104aae5d
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/golden.py
@@ -0,0 +1,59 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for chain_N test (linear dependency chain).
+
+Builds a chain of N tasks where each adds 1.0 to its input:
+  seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result
+  result = N.0
+
+Measures dependency chain resolution overhead vs chain length.
+
+Cases sweep chain length: 4, 8, 16, 32, 64.
+
+Args layout: [seed, result, chain_len]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["result"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+ALL_CASES = {
+    "Chain4": {"chain_len": 4},
+    "Chain8": {"chain_len": 8},
+    "Chain16": {"chain_len": 16},
+    "Chain32": {"chain_len": 32},
+    "Chain64": {"chain_len": 64},
+}
+
+DEFAULT_CASE = "Chain32"
+
+
+def generate_inputs(params: dict) -> list:
+    chain_len = params["chain_len"]
+
+    seed = torch.zeros(1, dtype=torch.float32)
+    result = torch.zeros(1, dtype=torch.float32)
+
+    return [
+        ("seed", seed),
+        ("result", result),
+        ("chain_len", ctypes.c_int64(chain_len)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    chain_len = params["chain_len"]
+    result = torch.as_tensor(tensors["result"])
+    result[0] = float(chain_len)
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp
new file mode 100644
index 00000000..e2231252
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_inc_aiv.cpp
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Increment AIV Kernel for Graph Topology Tests
+ *
+ * Minimal vector kernel: reads one input scalar, writes output = input + 1.0.
+ * Used to build DAG chains with verifiable accumulated values.
+ *
+ * Args:
+ *   args[0] = input tensor  (INPUT)  - single float32 element
+ *   args[1] = output tensor (OUTPUT/INOUT) - single float32 element
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* in_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[1]);
+    __gm__ float* in_ptr = reinterpret_cast<__gm__ float*>(in_tensor->buffer.addr) + in_tensor->start_offset;
+    __gm__ float* out_ptr = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out_ptr = *in_ptr + 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp
new file mode 100644
index 00000000..7c0ff340
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/aiv/kernel_noop_aiv.cpp
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * No-op AIV Kernel for Graph Topology Tests
+ *
+ * Minimal vector kernel that increments first arg by 1.0 (INOUT pattern).
+ * Additional args beyond args[0] are ignored by the kernel but create
+ * runtime dependencies for barrier/merge tasks in fan-in and diamond topologies.
+ *
+ * Args:
+ *   args[0] = output tensor (INOUT) - single float32 element
+ *   args[1..N] = dependency inputs (INPUT, ignored by kernel)
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out = *out + 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py
new file mode 100644
index 00000000..e5cbc61a
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/kernel_config.py
@@ -0,0 +1,42 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for chain_N test (tensormap_and_ringbuffer).
+
+Linear dependency chain: seed -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result.
+Uses a single AIV increment kernel (out = in + 1.0).
+
+Kernels:
+  func_id=0: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0
+"""
+
+from pathlib import Path
+
+_KERNELS_ROOT = Path(__file__).parent
+
+ORCHESTRATION = {
+    "source": str(_KERNELS_ROOT / "orchestration" / "chain_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "INC",
+        "source": str(_KERNELS_ROOT / "aiv" / "kernel_inc_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp
new file mode 100644
index 00000000..9e7e597f
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-chain_n/kernels/orchestration/chain_orch.cpp
@@ -0,0 +1,90 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Chain Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Builds a linear dependency chain of N tasks:
+ *   seed -> [Task_0] -> intermediate_0 -> [Task_1] -> ... -> [Task_{N-1}] -> result
+ *
+ * Each task reads its input and writes output = input + 1.0.
+ * After N tasks, result = N.0 (starting from seed = 0.0).
+ *
+ * Tasks 0..N-2 produce runtime-allocated intermediate tensors (OUTPUT).
+ * Task N-1 writes to the external result tensor (INOUT).
+ * This tests the runtime's INPUT->OUTPUT dependency resolution across a chain.
+ *
+ * Arg layout: [seed, result, chain_len]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_INC 0
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 3,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor seed = from_tensor_arg(orch_args.tensor(0));
+    Tensor result = from_tensor_arg(orch_args.tensor(1));
+    int chain_len = static_cast<int>(orch_args.scalar(0));
+
+    LOG_ALWAYS("[chain_N] chain_len=%d", chain_len);
+
+    uint32_t scalar_shape[1] = {1};
+    TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32);
+
+    if (chain_len == 1) {
+        // Single task: seed -> result
+        Arg params;
+        params.add_input(seed);
+        params.add_inout(result);
+        pto2_rt_submit_aiv_task(FUNC_INC, params);
+    } else {
+        // First task: seed -> intermediate_0
+        Arg first_params;
+        first_params.add_input(seed);
+        first_params.add_output(ci);
+        TaskOutputTensors prev = pto2_rt_submit_aiv_task(FUNC_INC, first_params);
+
+        // Middle tasks: intermediate_{i-1} -> intermediate_i
+        for (int i = 1; i < chain_len - 1; i++) {
+            Arg params;
+            params.add_input(prev.get_ref(0));
+            params.add_output(ci);
+            prev = pto2_rt_submit_aiv_task(FUNC_INC, params);
+        }
+
+        // Last task: intermediate_{N-2} -> result
+        Arg last_params;
+        last_params.add_input(prev.get_ref(0));
+        last_params.add_inout(result);
+        pto2_rt_submit_aiv_task(FUNC_INC, last_params);
+    }
+
+    LOG_ALWAYS("[chain_N] Submitted %d chained tasks", chain_len);
+}
+
+}  // extern "C"
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py
new file mode 100644
index 00000000..1147a734
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/golden.py
@@ -0,0 +1,80 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for diamond test (fork-join topology).
+
+Diamond DAG: A -> {B_0, B_1, ..., B_{W-1}} -> D
+
+  seed(0.0) -> [Source A] -> a_out(1.0) -> [Branch B_0] -> b_out_0(2.0) -.
+                                         -> [Branch B_1] -> b_out_1(2.0) -+-> [Merge D] -> result(1.0)
+                                         -> ...                            |
+                                         -> [Branch B_{W-1}] -> ...       -'
+
+Source A and branches each add 1.0 to their input. Merge D increments
+result from 0.0 to 1.0 (using the noop kernel, which only touches result).
+
+Three branch modes:
+  mode=0: All AIV branches
+  mode=1: All AIC branches
+  mode=2: Mixed AIC+AIV branches (even=AIC, odd=AIV)
+
+Cases sweep branch width (2/4/8/15) x mode (AIV/AIC/mixed).
+
+Args layout: [seed, result, width, mode]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["result"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+ALL_CASES = {
+    # All-AIV branches
+    "W2_AIV": {"width": 2, "mode": 0},
+    "W4_AIV": {"width": 4, "mode": 0},
+    "W8_AIV": {"width": 8, "mode": 0},
+    "W15_AIV": {"width": 15, "mode": 0},
+    # All-AIC branches
+    "W2_AIC": {"width": 2, "mode": 1},
+    "W4_AIC": {"width": 4, "mode": 1},
+    "W8_AIC": {"width": 8, "mode": 1},
+    "W15_AIC": {"width": 15, "mode": 1},
+    # Mixed AIC+AIV branches
+    "W2_Mixed": {"width": 2, "mode": 2},
+    "W4_Mixed": {"width": 4, "mode": 2},
+    "W8_Mixed": {"width": 8, "mode": 2},
+    "W15_Mixed": {"width": 15, "mode": 2},
+}
+
+DEFAULT_CASE = "W15_AIV"
+
+
+def generate_inputs(params: dict) -> list:
+    width = params["width"]
+    mode = params["mode"]
+
+    seed = torch.zeros(1, dtype=torch.float32)
+    result = torch.zeros(1, dtype=torch.float32)
+
+    return [
+        ("seed", seed),
+        ("result", result),
+        ("width", ctypes.c_int64(width)),
+        ("mode", ctypes.c_int64(mode)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    result = torch.as_tensor(tensors["result"])
+    # Merge D increments result from 0.0 to 1.0
+    result[0] = 1.0
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp
new file mode 100644
index 00000000..823dd38e
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/aic/kernel_inc_aic.cpp
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Increment AIC Kernel for Diamond Topology Test
+ *
+ * Minimal cube kernel: reads one input scalar, writes output = input + 1.0.
+ * AIC counterpart of kernel_inc_aiv.cpp for mixed AIC+AIV branch testing.
+ *
+ * Args:
+ *   args[0] = input tensor  (INPUT)  - single float32 element
+ *   args[1] = output tensor (OUTPUT/INOUT) - single float32 element
+ */
+
+#include <cstdint>
+#include <pto/pto-inst.hpp>
+
+#include "tensor.h"
+
+using namespace pto;  // NOLINT(build/namespaces)
+
+#ifndef __gm__
+#define __gm__
+#endif
+
+#ifndef __aicore__
+#define __aicore__ [aicore]  // NOLINT(whitespace/braces)
+#endif
+
+extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
+    __gm__ Tensor* in_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
+    __gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[1]);
+    __gm__ float* in_ptr = reinterpret_cast<__gm__ float*>(in_tensor->buffer.addr) + in_tensor->start_offset;
+    __gm__ float* out_ptr = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
+    *out_ptr = *in_ptr + 1.0f;
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py
new file mode 100644
index 00000000..649f7040
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/kernel_config.py
@@ -0,0 +1,57 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for diamond test (tensormap_and_ringbuffer).
+
+Fork-join topology: A -> {B_0, ..., B_{W-1}} -> D.
+Supports all-AIV, all-AIC, and mixed AIC+AIV branch modes.
+
+Kernels:
+  func_id=0: kernel_inc_aic  (AIC) - reads input, writes output = input + 1.0
+  func_id=1: kernel_inc_aiv  (AIV) - reads input, writes output = input + 1.0
+  func_id=2: kernel_noop_aiv (AIV) - increments INOUT by 1.0 (merge kernel)
+"""
+
+from pathlib import Path
+
+_KERNELS_ROOT = Path(__file__).parent
+_CHAIN_KERNELS = _KERNELS_ROOT / ".." / ".." / "graph-chain_n" / "kernels"
+
+ORCHESTRATION = {
+    "source": str(_KERNELS_ROOT / "orchestration" / "diamond_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "INC_AIC",
+        "source": str(_KERNELS_ROOT / "aic" / "kernel_inc_aic.cpp"),
+        "core_type": "aic",
+    },
+    {
+        "func_id": 1,
+        "name": "INC_AIV",
+        "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"),
+        "core_type": "aiv",
+    },
+    {
+        "func_id": 2,
+        "name": "NOOP_AIV",
+        "source": str(_CHAIN_KERNELS / "aiv" / "kernel_noop_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp
new file mode 100644
index 00000000..78ebb8de
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-diamond/kernels/orchestration/diamond_orch.cpp
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Diamond (Fork-Join) Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Builds a diamond DAG: A -> {B_0, B_1, ..., B_{W-1}} -> D
+ *
+ *   seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -.
+ *                               -> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result
+ *                               -> ...                       |
+ *                               -> [Branch B_{W-1}] -> b_out_{W-1} -'
+ *
+ * Source A is always AIV. Merge D is always AIV (NOOP kernel).
+ * Branch tasks vary by mode:
+ *   mode=0: All AIV branches
+ *   mode=1: All AIC branches
+ *   mode=2: Alternating AIC/AIV branches (even=AIC, odd=AIV)
+ *
+ * Tests: fan-out + fan-in combined, the most common real DAG pattern.
+ * With mixed modes, also tests cross-core-type dependency coordination.
+ *
+ * Arg layout: [seed, result, width, mode]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_INC_AIC 0
+#define FUNC_INC_AIV 1
+#define FUNC_NOOP_AIV 2
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 4,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor seed = from_tensor_arg(orch_args.tensor(0));
+    Tensor result = from_tensor_arg(orch_args.tensor(1));
+    int width = static_cast<int>(orch_args.scalar(0));
+    int mode = static_cast<int>(orch_args.scalar(1));
+
+    LOG_ALWAYS("[diamond] width=%d, mode=%d (0=AIV, 1=AIC, 2=mixed)", width, mode);
+
+    uint32_t scalar_shape[1] = {1};
+    TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32);
+
+    // Source A (always AIV): seed -> a_out
+    Arg src_params;
+    src_params.add_input(seed);
+    src_params.add_output(ci);
+    TaskOutputTensors source_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, src_params);
+
+    // Build merge args incrementally
+    Arg merge_params;
+    merge_params.add_inout(result);
+
+    // Branch tasks: each reads source output, produces branch output
+    for (int i = 0; i < width; i++) {
+        Arg bp;
+        bp.add_input(source_outs.get_ref(0));
+        bp.add_output(ci);
+
+        TaskOutputTensors branch_outs;
+        if (mode == 0) {
+            // All AIV
+            branch_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, bp);
+        } else if (mode == 1) {
+            // All AIC
+            branch_outs = pto2_rt_submit_aic_task(FUNC_INC_AIC, bp);
+        } else {
+            // Mixed: even=AIC, odd=AIV
+            if (i % 2 == 0) {
+                branch_outs = pto2_rt_submit_aic_task(FUNC_INC_AIC, bp);
+            } else {
+                branch_outs = pto2_rt_submit_aiv_task(FUNC_INC_AIV, bp);
+            }
+        }
+
+        merge_params.add_input(branch_outs.get_ref(0));
+    }
+
+    // Merge D (always AIV): waits for all branches, increments result
+    pto2_rt_submit_aiv_task(FUNC_NOOP_AIV, merge_params);
+
+    LOG_ALWAYS("[diamond] Submitted 1 source + %d branches + 1 merge task", width);
+}
+
+}  // extern "C"
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py
new file mode 100644
index 00000000..653f69fe
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/golden.py
@@ -0,0 +1,65 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for fanin_N test (convergence barrier topology).
+
+N independent producer tasks converge into 1 barrier task:
+  seed(0.0) -> [Producer_0] -> prod_out_0 -.
+  seed(0.0) -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result = 1.0
+  ...                                      |
+  seed(0.0) -> [Producer_{N-1}] -> prod_out_{N-1} -'
+
+Each producer writes to an independent runtime tensor (no inter-producer deps).
+The barrier task depends on all N producer outputs (via INPUT args) and writes
+to the result tensor (INOUT), adding 1.0.
+
+Tests: dependency convergence overhead — how efficiently the runtime tracks
+N predecessors for a single barrier task.
+
+Cases sweep fan-in width: 2, 4, 8, 15.
+
+Args layout: [seed, result, fanin_width]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["result"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+ALL_CASES = {
+    "Fanin2": {"fanin_width": 2},
+    "Fanin4": {"fanin_width": 4},
+    "Fanin8": {"fanin_width": 8},
+    "Fanin15": {"fanin_width": 15},
+}
+
+DEFAULT_CASE = "Fanin15"
+
+
+def generate_inputs(params: dict) -> list:
+    fanin_width = params["fanin_width"]
+
+    seed = torch.zeros(1, dtype=torch.float32)
+    result = torch.zeros(1, dtype=torch.float32)
+
+    return [
+        ("seed", seed),
+        ("result", result),
+        ("fanin_width", ctypes.c_int64(fanin_width)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    result = torch.as_tensor(tensors["result"])
+    # Barrier increments result from 0.0 to 1.0
+    result[0] = 1.0
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py
new file mode 100644
index 00000000..1c84467d
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/kernel_config.py
@@ -0,0 +1,49 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for fanin_N test (tensormap_and_ringbuffer).
+
+Fan-in topology: N producers -> 1 barrier.
+Reuses chain_N's AIV kernels for both increment and noop operations.
+
+Kernels:
+  func_id=0: kernel_inc_aiv  (AIV) - reads input, writes output = input + 1.0
+  func_id=1: kernel_noop_aiv (AIV) - increments INOUT by 1.0 (barrier kernel)
+"""
+
+from pathlib import Path
+
+_CHAIN_KERNELS = Path(__file__).parent / ".." / ".." / "graph-chain_n" / "kernels"
+
+ORCHESTRATION = {
+    "source": str(Path(__file__).parent / "orchestration" / "fanin_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "INC",
+        "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"),
+        "core_type": "aiv",
+    },
+    {
+        "func_id": 1,
+        "name": "NOOP",
+        "source": str(_CHAIN_KERNELS / "aiv" / "kernel_noop_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp
new file mode 100644
index 00000000..c140342c
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanin_n/kernels/orchestration/fanin_orch.cpp
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Fan-In Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Builds a fan-in DAG: N independent producer tasks -> 1 barrier task.
+ *
+ *   seed -> [Producer_0] -> prod_out_0 -.
+ *   seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result
+ *   ...                                 |
+ *   seed -> [Producer_{N-1}] -> prod_out_{N-1} -'
+ *
+ * Each producer reads seed (INPUT) and writes to an independent runtime
+ * tensor (OUTPUT). The barrier task reads all N producer outputs (INPUT
+ * for dependency tracking) and writes to result (INOUT).
+ *
+ * The barrier kernel (FUNC_NOOP) only uses args[0] (result INOUT).
+ * Producer output refs at args[1..N] are unused by the kernel but create
+ * runtime dependencies that force the barrier to wait for all producers.
+ *
+ * Tests: dependency convergence overhead, tracking N predecessors efficiently.
+ *
+ * Arg layout: [seed, result, fanin_width]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_INC 0
+#define FUNC_NOOP 1
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 3,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor seed = from_tensor_arg(orch_args.tensor(0));
+    Tensor result = from_tensor_arg(orch_args.tensor(1));
+    int fanin_width = static_cast<int>(orch_args.scalar(0));
+
+    LOG_ALWAYS("[fanin_N] fanin_width=%d", fanin_width);
+
+    uint32_t scalar_shape[1] = {1};
+    TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32);
+
+    // Build barrier args incrementally: result (INOUT) + all producer outputs (INPUT)
+    Arg barrier_params;
+    barrier_params.add_inout(result);
+
+    // Submit N independent producers, collecting their output refs
+    for (int i = 0; i < fanin_width; i++) {
+        Arg p;
+        p.add_input(seed);
+        p.add_output(ci);
+        TaskOutputTensors outs = pto2_rt_submit_aiv_task(FUNC_INC, p);
+        barrier_params.add_input(outs.get_ref(0));
+    }
+
+    // Barrier task: waits for all producers, then increments result
+    pto2_rt_submit_aiv_task(FUNC_NOOP, barrier_params);
+
+    LOG_ALWAYS("[fanin_N] Submitted %d producers + 1 barrier task", fanin_width);
+}
+
+}  // extern "C"
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py
new file mode 100644
index 00000000..a1eb11fe
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/golden.py
@@ -0,0 +1,70 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Golden script for fanout_N test (wide fan-out topology).
+
+1 source task fans out to N independent consumer tasks:
+  seed(0.0) -> [Source] -> intermediate(1.0) -> [Consumer_0] -> result[0] = 2.0
+                                              -> [Consumer_1] -> result[1] = 2.0
+                                              -> ...
+                                              -> [Consumer_{N-1}] -> result[N-1] = 2.0
+
+Tests parallel dispatch capability: can the runtime simultaneously issue
+N independent tasks that all read from the same source output?
+
+Consumer output slots are cache-line aligned (64B = 16 float32 elements)
+to avoid false sharing.
+
+Cases sweep fan-out width: 2, 4, 8, 15.
+
+Args layout: [seed, result, fanout_width]
+"""
+
+import ctypes
+
+import torch
+
+__outputs__ = ["result"]
+
+RTOL = 1e-5
+ATOL = 1e-5
+
+CACHE_LINE_ELEMS = 16
+
+ALL_CASES = {
+    "Fanout2": {"fanout_width": 2},
+    "Fanout4": {"fanout_width": 4},
+    "Fanout8": {"fanout_width": 8},
+    "Fanout15": {"fanout_width": 15},
+}
+
+DEFAULT_CASE = "Fanout15"
+
+
+def generate_inputs(params: dict) -> list:
+    fanout_width = params["fanout_width"]
+
+    seed = torch.zeros(1, dtype=torch.float32)
+    result = torch.zeros(fanout_width * CACHE_LINE_ELEMS, dtype=torch.float32)
+
+    return [
+        ("seed", seed),
+        ("result", result),
+        ("fanout_width", ctypes.c_int64(fanout_width)),
+    ]
+
+
+def compute_golden(tensors: dict, params: dict) -> None:
+    fanout_width = params["fanout_width"]
+    result = torch.as_tensor(tensors["result"])
+
+    # Source output = seed(0.0) + 1.0 = 1.0
+    # Each consumer output = source(1.0) + 1.0 = 2.0
+    for i in range(fanout_width):
+        result[i * CACHE_LINE_ELEMS] = 2.0
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py
new file mode 100644
index 00000000..4e30a29c
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/kernel_config.py
@@ -0,0 +1,42 @@
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
+"""
+Kernel configuration for fanout_N test (tensormap_and_ringbuffer).
+
+Fan-out topology: 1 source -> N consumers.
+Reuses chain_N's AIV increment kernel.
+
+Kernels:
+  func_id=0: kernel_inc_aiv (AIV) - reads input, writes output = input + 1.0
+"""
+
+from pathlib import Path
+
+_CHAIN_KERNELS = Path(__file__).parent / ".." / ".." / "graph-chain_n" / "kernels"
+
+ORCHESTRATION = {
+    "source": str(Path(__file__).parent / "orchestration" / "fanout_orch.cpp"),
+    "function_name": "aicpu_orchestration_entry",
+}
+
+KERNELS = [
+    {
+        "func_id": 0,
+        "name": "INC",
+        "source": str(_CHAIN_KERNELS / "aiv" / "kernel_inc_aiv.cpp"),
+        "core_type": "aiv",
+    },
+]
+
+RUNTIME_CONFIG = {
+    "runtime": "tensormap_and_ringbuffer",
+    "aicpu_thread_num": 4,
+    "orch_thread_num": 1,
+    "block_dim": 24,
+}
diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp
new file mode 100644
index 00000000..20824f62
--- /dev/null
+++ b/tests/st/a2a3/tensormap_and_ringbuffer/graph-fanout_n/kernels/orchestration/fanout_orch.cpp
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+/**
+ * Fan-Out Orchestration (tensormap_and_ringbuffer Runtime)
+ *
+ * Builds a fan-out DAG: 1 source task -> N independent consumer tasks.
+ *
+ *   seed -> [Source] -> intermediate -> [Consumer_0] -> result[0]
+ *                                    -> [Consumer_1] -> result[1]
+ *                                    -> ...
+ *                                    -> [Consumer_{N-1}] -> result[N-1]
+ *
+ * Source produces a runtime tensor via OUTPUT. All N consumers read
+ * that tensor (INPUT) and write to separate cache-line-aligned slots
+ * in the result tensor (INOUT).
+ *
+ * Tests: parallel dispatch capability, core utilization with N independent
+ * ready-to-run tasks.
+ *
+ * Arg layout: [seed, result, fanout_width]
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "pto_orchestration_api.h"  // NOLINT(build/include_subdir)
+
+#define FUNC_INC 0
+
+extern "C" {
+
+__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config(
+    const ChipStorageTaskArgs& orch_args) {
+    (void)orch_args;
+    return PTO2OrchestrationConfig{
+        .expected_arg_count = 3,
+    };
+}
+
+__attribute__((visibility("default"))) void aicpu_orchestration_entry(
+    const ChipStorageTaskArgs& orch_args, int orch_thread_num, int orch_thread_index) {
+    if (orch_thread_index != 0) {
+        return;
+    }
+
+    Tensor seed = from_tensor_arg(orch_args.tensor(0));
+    Tensor result = from_tensor_arg(orch_args.tensor(1));
+    int fanout_width = static_cast<int>(orch_args.scalar(0));
+
+    LOG_ALWAYS("[fanout_N] fanout_width=%d", fanout_width);
+
+    uint32_t scalar_shape[1] = {1};
+    TensorCreateInfo ci(scalar_shape, 1, DataType::FLOAT32);
+
+    // Source task: seed -> intermediate
+    Arg src_params;
+    src_params.add_input(seed);
+    src_params.add_output(ci);
+    TaskOutputTensors source_outs = pto2_rt_submit_aiv_task(FUNC_INC, src_params);
+
+    // Consumer tasks: each reads source output, writes to separate result slot
+    constexpr uint32_t CACHE_LINE_ELEMS = 16;
+    uint32_t slot_shape[1] = {1};
+
+    for (int i = 0; i < fanout_width; i++) {
+        uint32_t view_offsets[1] = {static_cast<uint32_t>(i * CACHE_LINE_ELEMS)};
+        Tensor result_slot = result.view(slot_shape, view_offsets);
+
+        Arg params;
+        params.add_input(source_outs.get_ref(0));
+        params.add_inout(result_slot);
+        pto2_rt_submit_aiv_task(FUNC_INC, params);
+    }
+
+    LOG_ALWAYS("[fanout_N] Submitted 1 source + %d consumer tasks", fanout_width);
+}
+
+}  // extern "C"
diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh
index 64b283e8..43e86e5d 100755
--- a/tools/benchmark_rounds.sh
+++ b/tools/benchmark_rounds.sh
@@ -1,4 +1,12 @@
 #!/usr/bin/env bash
+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+# -----------------------------------------------------------------------------------------------------------
 # Benchmark wrapper: run examples on hardware,
 # then parse device-log timing lines to report per-round latency.
 #
@@ -25,12 +33,24 @@ declare -A TMR_EXAMPLE_CASES=(
     [benchmark_bgemm]=""
     [paged_attention_unroll]="Case1,Case2"
     [batch_paged_attention]=""
+    [dispatch-independent]=""
+    [dispatch-serial]=""
+    [graph-chain_n]=""
+    [graph-fanin_n]=""
+    [graph-fanout_n]=""
+    [graph-diamond]=""
 )
 TMR_EXAMPLE_ORDER=(
     alternating_matmul_add
     benchmark_bgemm
     paged_attention_unroll
     batch_paged_attention
+    dispatch-independent
+    dispatch-serial
+    graph-chain_n
+    graph-fanin_n
+    graph-fanout_n
+    graph-diamond
 )
 
 # --- aicpu_build_graph ---