Skip to content

Commit 2cf31d2

Browse files
Add: benchmark scene tests for dispatch and graph topologies
1 parent 1d97ac5 commit 2cf31d2

28 files changed

Lines changed: 1796 additions & 31 deletions

File tree

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Benchmark Scene Tests
2+
3+
This directory contains benchmark scene tests for the `tensormap_and_ringbuffer` runtime on the A2/A3 platform. These tests are designed to systematically characterize runtime performance across two dimensions: **dispatch overhead** and **graph topology**.
4+
5+
All tests use trivial kernels (noop or increment-by-one) to isolate runtime scheduling overhead from compute. Results are collected via `tools/benchmark_rounds.sh`.
6+
7+
## Scene 1: Dispatch & Scheduling Overhead
8+
9+
These tests isolate and quantify the runtime's "scheduling tax" — framework overhead independent of kernel computation.
10+
11+
### dispatch-independent (Task Scaling)
12+
13+
**Intent**: Measure how dispatch overhead grows with task count when tasks are fully independent (no inter-task data dependencies).
14+
15+
Each task writes `1.0` to its own cache-line-aligned slot (stride = 16 float32 = 64 bytes) in a shared output tensor, avoiding false sharing across non-coherent AICore L1 caches.
16+
17+
| Parameter | Values |
18+
| --------- | --------------------------------------- |
19+
| num_tasks | 100, 500, 1000, 2000 |
20+
| mode | AIC-only, AIV-only, AIC+AIV alternating |
21+
22+
**What to look for**: Linear growth in total dispatch time vs. task count. Super-linear growth indicates a scheduling bottleneck (e.g., O(N^2) dependency tracking).
23+
24+
### dispatch-serial (Dispatch Throughput)
25+
26+
**Intent**: Measure maximum scheduler throughput under serial task submission with accumulation dependencies.
27+
28+
All N tasks write to the same counter (AIC counter or AIV counter), forming a serial dependency chain. The final counter value equals N, validating correctness.
29+
30+
| Parameter | Values |
31+
| --------- | --------------------------------------- |
32+
| num_tasks | 100, 500, 1000, 2000 |
33+
| mode | AIC-only, AIV-only, AIC+AIV alternating |
34+
35+
**What to look for**: Per-task dispatch latency (total time / N). Compare with `dispatch-independent` to quantify the overhead of serial dependencies vs. independent dispatch.
36+
37+
## Scene 2: Graph Topology Patterns
38+
39+
These tests stress-test the scheduler with different DAG dependency structures. Each topology exercises a different aspect of dependency resolution.
40+
41+
### graph-chain_n (Linear Chain)
42+
43+
**Intent**: Measure serial dependency resolution overhead as chain length increases.
44+
45+
```text
46+
seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result(N.0)
47+
```
48+
49+
Each task is an AIV increment kernel (`out = in + 1.0`). The result equals the chain length, validating every link executed.
50+
51+
| Parameter | Values |
52+
| --------- | ----------------- |
53+
| chain_len | 4, 8, 16, 32, 64 |
54+
55+
**What to look for**: End-to-end latency vs. chain length. Ideally linear; deviation reveals per-hop scheduling overhead.
56+
57+
### graph-fanout_n (Wide Fan-Out)
58+
59+
**Intent**: Test parallel dispatch capability — can the runtime simultaneously issue N independent tasks from a single source?
60+
61+
```text
62+
seed -> [Source] -> intermediate -> [Consumer_0] -> result[0]
63+
-> [Consumer_1] -> result[1]
64+
-> ...
65+
-> [Consumer_{N-1}] -> result[N-1]
66+
```
67+
68+
Consumer output slots are cache-line-aligned to avoid false sharing. Each consumer reads the same source output and writes `source + 1.0`.
69+
70+
| Parameter | Values |
71+
| ------------ | ------------ |
72+
| fanout_width | 2, 4, 8, 15 |
73+
74+
**What to look for**: Whether fan-out width impacts total latency. Ideal runtime dispatches all consumers in parallel, so latency should plateau rather than grow linearly.
75+
76+
### graph-fanin_n (Convergence Barrier)
77+
78+
**Intent**: Measure dependency convergence overhead — how efficiently the runtime tracks N predecessors for a single barrier task.
79+
80+
```text
81+
seed -> [Producer_0] -> prod_out_0 -.
82+
seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result(1.0)
83+
... |
84+
seed -> [Producer_{N-1}] -> ... -'
85+
```
86+
87+
Each producer writes independently; the barrier depends on all N producer outputs.
88+
89+
| Parameter | Values |
90+
| ----------- | ------------ |
91+
| fanin_width | 2, 4, 8, 15 |
92+
93+
**What to look for**: Barrier wait overhead vs. fan-in width. Measures the cost of tracking and synchronizing N predecessor completions.
94+
95+
### graph-diamond (Fork-Join)
96+
97+
**Intent**: Test the most common real-world DAG pattern — fan-out followed by fan-in (fork-join).
98+
99+
```text
100+
seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -.
101+
-> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result(1.0)
102+
-> ... |
103+
-> [Branch B_{W-1}] -> ... -'
104+
```
105+
106+
Three branch modes exercise different core-type scheduling paths:
107+
108+
- **mode=0**: All AIV branches
109+
- **mode=1**: All AIC branches
110+
- **mode=2**: Mixed AIC+AIV (even=AIC, odd=AIV)
111+
112+
| Parameter | Values |
113+
| --------- | --------------------------------- |
114+
| width | 2, 4, 8, 15 |
115+
| mode | AIV-only, AIC-only, Mixed AIC+AIV |
116+
117+
**What to look for**: Combined fan-out + fan-in overhead. Compare with isolated fanout/fanin tests to check for compounding effects. Mixed mode reveals cross-core-type scheduling costs.
118+
119+
## Also Updated: benchmark_bgemm
120+
121+
The existing `benchmark_bgemm` test was extended with structured parameter sweeps:
122+
123+
- **Tile size sweep** (16, 32, 64, 128) at fixed batch and grid_k
124+
- **Batch/group sweep** (1, 4, 16, 64 groups) at fixed tile size
125+
- **Grid-K sweep** (1, 2, 4) at fixed tile and batch
126+
127+
These complement the original 5 cases with systematic single-variable sweeps for identifying performance cliffs.
128+
129+
## Running
130+
131+
```bash
132+
# Run all benchmark scene tests (100 rounds each, default)
133+
./tools/benchmark_rounds.sh
134+
135+
# Customize
136+
./tools/benchmark_rounds.sh -n 50 -d 0 -p a2a3 -r tensormap_and_ringbuffer -v
137+
```

tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py

Lines changed: 67 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# Copyright (c) PyPTO Contributors.
2+
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
3+
# CANN Open Software License Agreement Version 2.0 (the "License").
4+
# Please refer to the License for details. You may not use this file except in compliance with the License.
5+
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
6+
# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
7+
# See LICENSE in the root of the software repository for the full text of the License.
8+
# -----------------------------------------------------------------------------------------------------------
19
"""
210
Golden test specification for BGEMM (tensormap_and_ringbuffer Runtime).
311
@@ -24,49 +32,81 @@
2432
SUPPORTED_INCORE_DATA_SIZES = {16, 32, 64, 128}
2533

2634
ALL_CASES = {
27-
"Case0": {
35+
"Case1": {
2836
"matmul_add_task_num": 500,
2937
"incore_task_granularity": {
3038
"incore_data_size": 128,
3139
"incore_loop": 4,
3240
},
3341
"grid_k": 2,
3442
},
35-
"Case1": {
36-
"matmul_add_task_num": 64,
37-
"incore_task_granularity": {
38-
"incore_data_size": 128,
39-
"incore_loop": 4,
40-
},
43+
# --- Tile Size Sweep (fixed: num_groups=16, grid_k=2, incore_loop=4) ---
44+
"Tile16": {
45+
"matmul_add_task_num": 32,
46+
"incore_task_granularity": {"incore_data_size": 16, "incore_loop": 4},
4147
"grid_k": 2,
4248
},
43-
"Case2": {
44-
"matmul_add_task_num": 256,
45-
"incore_task_granularity": {
46-
"incore_data_size": 128,
47-
"incore_loop": 4,
48-
},
49+
"Tile32": {
50+
"matmul_add_task_num": 32,
51+
"incore_task_granularity": {"incore_data_size": 32, "incore_loop": 4},
4952
"grid_k": 2,
5053
},
51-
"Case3": {
52-
"matmul_add_task_num": 64,
53-
"incore_task_granularity": {
54-
"incore_data_size": 128,
55-
"incore_loop": 16,
56-
},
54+
"Tile64": {
55+
"matmul_add_task_num": 32,
56+
"incore_task_granularity": {"incore_data_size": 64, "incore_loop": 4},
57+
"grid_k": 2,
58+
},
59+
"Tile128": {
60+
"matmul_add_task_num": 32,
61+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
5762
"grid_k": 2,
5863
},
59-
"Case4": {
64+
# --- Batch/Group Sweep (fixed: tile=128, grid_k=2, incore_loop=4) ---
65+
"Batch1": {
66+
"matmul_add_task_num": 2,
67+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
68+
"grid_k": 2,
69+
},
70+
"Batch4": {
71+
"matmul_add_task_num": 8,
72+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
73+
"grid_k": 2,
74+
},
75+
"Batch64": {
76+
"matmul_add_task_num": 128,
77+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
78+
"grid_k": 2,
79+
},
80+
# --- K Dimension Sweep (fixed: tile=128, num_groups=16, incore_loop=4) ---
81+
"K1": {
82+
"matmul_add_task_num": 16,
83+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
84+
"grid_k": 1,
85+
},
86+
"K4": {
6087
"matmul_add_task_num": 64,
61-
"incore_task_granularity": {
62-
"incore_data_size": 128,
63-
"incore_loop": 4,
64-
},
88+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
6589
"grid_k": 4,
6690
},
91+
"K8": {
92+
"matmul_add_task_num": 128,
93+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
94+
"grid_k": 8,
95+
},
96+
# --- In-Core Loop Sweep (fixed: tile=128, num_groups=16, grid_k=2) ---
97+
"Loop1": {
98+
"matmul_add_task_num": 32,
99+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 1},
100+
"grid_k": 2,
101+
},
102+
"Loop16": {
103+
"matmul_add_task_num": 32,
104+
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 16},
105+
"grid_k": 2,
106+
},
67107
}
68108

69-
DEFAULT_CASE = "Case0"
109+
DEFAULT_CASE = "Case1"
70110

71111

72112
def generate_inputs(params: dict) -> list:
@@ -80,18 +120,14 @@ def generate_inputs(params: dict) -> list:
80120
# --- constraint checks ---
81121
if tile_size not in SUPPORTED_INCORE_DATA_SIZES:
82122
raise ValueError(
83-
f"incore_data_size={tile_size} is not supported. "
84-
f"Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
123+
f"incore_data_size={tile_size} is not supported. Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
85124
)
86125
if incore_loop <= 0:
87126
raise ValueError(f"incore_loop must be positive, got {incore_loop}")
88127
if grid_k <= 0:
89128
raise ValueError(f"grid_k must be positive, got {grid_k}")
90129
if matmul_add_task_num % grid_k != 0:
91-
raise ValueError(
92-
f"matmul_add_task_num ({matmul_add_task_num}) must be "
93-
f"divisible by grid_k ({grid_k})."
94-
)
130+
raise ValueError(f"matmul_add_task_num ({matmul_add_task_num}) must be divisible by grid_k ({grid_k}).")
95131

96132
num_groups = matmul_add_task_num // grid_k
97133

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Copyright (c) PyPTO Contributors.
2+
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
3+
# CANN Open Software License Agreement Version 2.0 (the "License").
4+
# Please refer to the License for details. You may not use this file except in compliance with the License.
5+
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
6+
# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
7+
# See LICENSE in the root of the software repository for the full text of the License.
8+
# -----------------------------------------------------------------------------------------------------------
9+
"""
10+
Golden script for task_scaling test.
11+
12+
Measures dispatch overhead vs task count. Submits N independent noop tasks,
13+
each writing 1.0 to a separate cache-line-aligned slot. Output tensor is
14+
padded so each task's slot sits on its own cache line (stride = 16 float32
15+
elements = 64 bytes), avoiding false sharing across non-coherent AICore L1
16+
caches.
17+
18+
Cases parameterize task count (100→2000) and core type:
19+
AIC-only sweep: 100, 500, 1000, 2000 tasks
20+
AIV-only sweep: 100, 500, 1000, 2000 tasks
21+
AIC+AIV sweep: 100, 500, 1000, 2000 tasks
22+
23+
Args layout: [output, num_tasks, mode]
24+
"""
25+
26+
import ctypes
27+
28+
import torch
29+
30+
__outputs__ = ["output"]
31+
32+
RTOL = 1e-5
33+
ATOL = 1e-5
34+
35+
# Each task writes to a separate cache line to avoid false sharing
36+
# across non-coherent AICore L1 caches (64B = 16 float32 elements).
37+
CACHE_LINE_ELEMS = 16
38+
39+
ALL_CASES = {
40+
# AIC-only (mode=0)
41+
"Case1": {"num_tasks": 100, "mode": 0},
42+
"Case2": {"num_tasks": 500, "mode": 0},
43+
"Case3": {"num_tasks": 1000, "mode": 0},
44+
"Case4": {"num_tasks": 2000, "mode": 0},
45+
# AIV-only (mode=1)
46+
"Case5": {"num_tasks": 100, "mode": 1},
47+
"Case6": {"num_tasks": 500, "mode": 1},
48+
"Case7": {"num_tasks": 1000, "mode": 1},
49+
"Case8": {"num_tasks": 2000, "mode": 1},
50+
# AIC+AIV alternating (mode=2)
51+
"Case9": {"num_tasks": 100, "mode": 2},
52+
"Case10": {"num_tasks": 500, "mode": 2},
53+
"Case11": {"num_tasks": 1000, "mode": 2},
54+
"Case12": {"num_tasks": 2000, "mode": 2},
55+
}
56+
57+
DEFAULT_CASE = "Case2"
58+
59+
60+
def generate_inputs(params: dict) -> list:
61+
num_tasks = params["num_tasks"]
62+
mode = params["mode"]
63+
64+
output = torch.zeros(num_tasks * CACHE_LINE_ELEMS, dtype=torch.float32)
65+
66+
return [
67+
("output", output),
68+
("num_tasks", ctypes.c_int64(num_tasks)),
69+
("mode", ctypes.c_int64(mode)),
70+
]
71+
72+
73+
def compute_golden(tensors: dict, params: dict) -> None:
74+
num_tasks = params["num_tasks"]
75+
output = torch.as_tensor(tensors["output"])
76+
77+
# Each independent task writes 1.0 to its cache-line-aligned slot
78+
for i in range(num_tasks):
79+
output[i * CACHE_LINE_ELEMS] = 1.0
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
/*
2+
* Copyright (c) PyPTO Contributors.
3+
* This program is free software, you can redistribute it and/or modify it under the terms and conditions of
4+
* CANN Open Software License Agreement Version 2.0 (the "License").
5+
* Please refer to the License for details. You may not use this file except in compliance with the License.
6+
* THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
7+
* INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
8+
* See LICENSE in the root of the software repository for the full text of the License.
9+
* -----------------------------------------------------------------------------------------------------------
10+
*/
11+
/**
12+
* No-op AIC Kernel for Task Scaling
13+
*
14+
* Minimal cube kernel that performs a trivial write. Each task writes 1.0
15+
* at its designated position in the output tensor, proving execution order.
16+
*
17+
* Args:
18+
* args[0] = output tensor (INOUT) - single float32 element per task
19+
*/
20+
21+
#include <cstdint>
22+
#include <pto/pto-inst.hpp>
23+
24+
#include "tensor.h"
25+
26+
using namespace pto; // NOLINT(build/namespaces)
27+
28+
#ifndef __gm__
29+
#define __gm__
30+
#endif
31+
32+
#ifndef __aicore__
33+
#define __aicore__ [aicore] // NOLINT(whitespace/braces)
34+
#endif
35+
36+
extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
37+
__gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
38+
__gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
39+
*out = 1.0f;
40+
}

0 commit comments

Comments
 (0)