Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions tests/st/a2a3/tensormap_and_ringbuffer/BENCHMARK_SCENES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Benchmark Scene Tests

This directory contains benchmark scene tests for the `tensormap_and_ringbuffer` runtime on the A2/A3 platform. These tests are designed to systematically characterize runtime performance across two dimensions: **dispatch overhead** and **graph topology**.

All tests use trivial kernels (noop or increment-by-one) to isolate runtime scheduling overhead from compute. Results are collected via `tools/benchmark_rounds.sh`.

## Scene 1: Dispatch & Scheduling Overhead

These tests isolate and quantify the runtime's "scheduling tax" — framework overhead independent of kernel computation.

### dispatch-independent (Task Scaling)

**Intent**: Measure how dispatch overhead grows with task count when tasks are fully independent (no inter-task data dependencies).

Each task writes `1.0` to its own cache-line-aligned slot (stride = 16 float32 = 64 bytes) in a shared output tensor, avoiding false sharing across non-coherent AICore L1 caches.

| Parameter | Values |
| --------- | ------ |
| num_tasks | 100, 500, 1000, 2000 |
| mode | AIC-only, AIV-only, AIC+AIV alternating |

**What to look for**: Linear growth in total dispatch time vs. task count. Super-linear growth indicates a scheduling bottleneck (e.g., O(N^2) dependency tracking).

### dispatch-serial (Dispatch Throughput)

**Intent**: Measure maximum scheduler throughput under serial task submission with accumulation dependencies.

All N tasks write to the same counter (AIC counter or AIV counter), forming a serial dependency chain. The final counter value equals N, validating correctness.

| Parameter | Values |
| --------- | ------ |
| num_tasks | 100, 500, 1000, 2000 |
| mode | AIC-only, AIV-only, AIC+AIV alternating |

**What to look for**: Per-task dispatch latency (total time / N). Compare with `dispatch-independent` to quantify the overhead of serial dependencies vs. independent dispatch.

## Scene 2: Graph Topology Patterns

These tests stress-test the scheduler with different DAG dependency structures. Each topology exercises a different aspect of dependency resolution.

### graph-chain_n (Linear Chain)

**Intent**: Measure serial dependency resolution overhead as chain length increases.

```text
seed(0.0) -> Task_0 -> Task_1 -> ... -> Task_{N-1} -> result(N.0)
```

Each task is an AIV increment kernel (`out = in + 1.0`). The result equals the chain length, validating every link executed.

| Parameter | Values |
| --------- | ------ |
| chain_len | 4, 8, 16, 32, 64 |

**What to look for**: End-to-end latency vs. chain length. Ideally linear; deviation reveals per-hop scheduling overhead.

### graph-fanout_n (Wide Fan-Out)

**Intent**: Test parallel dispatch capability — can the runtime simultaneously issue N independent tasks from a single source?

```text
seed -> [Source] -> intermediate -> [Consumer_0] -> result[0]
-> [Consumer_1] -> result[1]
-> ...
-> [Consumer_{N-1}] -> result[N-1]
```

Consumer output slots are cache-line-aligned to avoid false sharing. Each consumer reads the same source output and writes `source + 1.0`.

| Parameter | Values |
| --------- | ------ |
| fanout_width | 2, 4, 8, 15 |

**What to look for**: Whether fan-out width impacts total latency. Ideal runtime dispatches all consumers in parallel, so latency should plateau rather than grow linearly.

### graph-fanin_n (Convergence Barrier)

**Intent**: Measure dependency convergence overhead — how efficiently the runtime tracks N predecessors for a single barrier task.

```text
seed -> [Producer_0] -> prod_out_0 -.
seed -> [Producer_1] -> prod_out_1 -+-> [Barrier] -> result(1.0)
... |
seed -> [Producer_{N-1}] -> ... -'
```

Each producer writes independently; the barrier depends on all N producer outputs.

| Parameter | Values |
| --------- | ------ |
| fanin_width | 2, 4, 8, 15 |

**What to look for**: Barrier wait overhead vs. fan-in width. Measures the cost of tracking and synchronizing N predecessor completions.

### graph-diamond (Fork-Join)

**Intent**: Test the most common real-world DAG pattern — fan-out followed by fan-in (fork-join).

```text
seed -> [Source A] -> a_out -> [Branch B_0] -> b_out_0 -.
-> [Branch B_1] -> b_out_1 -+-> [Merge D] -> result(1.0)
-> ... |
-> [Branch B_{W-1}] -> ... -'
```

Three branch modes exercise different core-type scheduling paths:

- **mode=0**: All AIV branches
- **mode=1**: All AIC branches
- **mode=2**: Mixed AIC+AIV (even=AIC, odd=AIV)

| Parameter | Values |
| --------- | ------ |
| width | 2, 4, 8, 15 |
| mode | AIV-only, AIC-only, Mixed AIC+AIV |

**What to look for**: Combined fan-out + fan-in overhead. Compare with isolated fanout/fanin tests to check for compounding effects. Mixed mode reveals cross-core-type scheduling costs.

## Also Updated: benchmark_bgemm

The existing `benchmark_bgemm` test was extended with structured parameter sweeps:

- **Tile size sweep** (16, 32, 64, 128) at fixed batch and grid_k
- **Batch/group sweep** (1, 4, 16, 64 groups) at fixed tile size
- **Grid-K sweep** (1, 2, 4) at fixed tile and batch

These complement the original 5 cases with systematic single-variable sweeps for identifying performance cliffs.

## Running

```bash
# Run all benchmark scene tests (100 rounds each, default)
./tools/benchmark_rounds.sh

# Customize
./tools/benchmark_rounds.sh -n 50 -d 0 -p a2a3 -r tensormap_and_ringbuffer -v
```
98 changes: 67 additions & 31 deletions tests/st/a2a3/tensormap_and_ringbuffer/benchmark_bgemm/golden.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# Copyright (c) PyPTO Contributors.
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
# CANN Open Software License Agreement Version 2.0 (the "License").
# Please refer to the License for details. You may not use this file except in compliance with the License.
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
"""
Golden test specification for BGEMM (tensormap_and_ringbuffer Runtime).

Expand All @@ -24,49 +32,81 @@
SUPPORTED_INCORE_DATA_SIZES = {16, 32, 64, 128}

ALL_CASES = {
"Case0": {
"Case1": {
"matmul_add_task_num": 500,
"incore_task_granularity": {
"incore_data_size": 128,
"incore_loop": 4,
},
"grid_k": 2,
},
"Case1": {
"matmul_add_task_num": 64,
"incore_task_granularity": {
"incore_data_size": 128,
"incore_loop": 4,
},
# --- Tile Size Sweep (fixed: num_groups=16, grid_k=2, incore_loop=4) ---
"Tile16": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 16, "incore_loop": 4},
"grid_k": 2,
},
"Case2": {
"matmul_add_task_num": 256,
"incore_task_granularity": {
"incore_data_size": 128,
"incore_loop": 4,
},
"Tile32": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 32, "incore_loop": 4},
"grid_k": 2,
},
"Case3": {
"matmul_add_task_num": 64,
"incore_task_granularity": {
"incore_data_size": 128,
"incore_loop": 16,
},
"Tile64": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 64, "incore_loop": 4},
"grid_k": 2,
},
"Tile128": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 2,
},
"Case4": {
# --- Batch/Group Sweep (fixed: tile=128, grid_k=2, incore_loop=4) ---
"Batch1": {
"matmul_add_task_num": 2,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 2,
},
"Batch4": {
"matmul_add_task_num": 8,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 2,
},
"Batch64": {
"matmul_add_task_num": 128,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 2,
},
# --- K Dimension Sweep (fixed: tile=128, num_groups=16, incore_loop=4) ---
"K1": {
"matmul_add_task_num": 16,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 1,
},
"K4": {
"matmul_add_task_num": 64,
"incore_task_granularity": {
"incore_data_size": 128,
"incore_loop": 4,
},
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 4,
},
"K8": {
"matmul_add_task_num": 128,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 4},
"grid_k": 8,
},
# --- In-Core Loop Sweep (fixed: tile=128, num_groups=16, grid_k=2) ---
"Loop1": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 1},
"grid_k": 2,
},
"Loop16": {
"matmul_add_task_num": 32,
"incore_task_granularity": {"incore_data_size": 128, "incore_loop": 16},
"grid_k": 2,
},
}

DEFAULT_CASE = "Case0"
DEFAULT_CASE = "Case1"


def generate_inputs(params: dict) -> list:
Expand All @@ -80,18 +120,14 @@ def generate_inputs(params: dict) -> list:
# --- constraint checks ---
if tile_size not in SUPPORTED_INCORE_DATA_SIZES:
raise ValueError(
f"incore_data_size={tile_size} is not supported. "
f"Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
f"incore_data_size={tile_size} is not supported. Must be one of {sorted(SUPPORTED_INCORE_DATA_SIZES)}."
)
if incore_loop <= 0:
raise ValueError(f"incore_loop must be positive, got {incore_loop}")
if grid_k <= 0:
raise ValueError(f"grid_k must be positive, got {grid_k}")
if matmul_add_task_num % grid_k != 0:
raise ValueError(
f"matmul_add_task_num ({matmul_add_task_num}) must be "
f"divisible by grid_k ({grid_k})."
)
raise ValueError(f"matmul_add_task_num ({matmul_add_task_num}) must be divisible by grid_k ({grid_k}).")

num_groups = matmul_add_task_num // grid_k

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Copyright (c) PyPTO Contributors.
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
# CANN Open Software License Agreement Version 2.0 (the "License").
# Please refer to the License for details. You may not use this file except in compliance with the License.
# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
# See LICENSE in the root of the software repository for the full text of the License.
# -----------------------------------------------------------------------------------------------------------
"""
Golden script for task_scaling test.

Measures dispatch overhead vs task count. Submits N independent noop tasks,
each writing 1.0 to a separate cache-line-aligned slot. Output tensor is
padded so each task's slot sits on its own cache line (stride = 16 float32
elements = 64 bytes), avoiding false sharing across non-coherent AICore L1
caches.

Cases parameterize task count (100→2000) and core type:
AIC-only sweep: 100, 500, 1000, 2000 tasks
AIV-only sweep: 100, 500, 1000, 2000 tasks
AIC+AIV sweep: 100, 500, 1000, 2000 tasks

Args layout: [output, num_tasks, mode]
"""

import ctypes

import torch

__outputs__ = ["output"]

RTOL = 1e-5
ATOL = 1e-5

# Each task writes to a separate cache line to avoid false sharing
# across non-coherent AICore L1 caches (64B = 16 float32 elements).
CACHE_LINE_ELEMS = 16

ALL_CASES = {
# AIC-only (mode=0)
"Case1": {"num_tasks": 100, "mode": 0},
"Case2": {"num_tasks": 500, "mode": 0},
"Case3": {"num_tasks": 1000, "mode": 0},
"Case4": {"num_tasks": 2000, "mode": 0},
# AIV-only (mode=1)
"Case5": {"num_tasks": 100, "mode": 1},
"Case6": {"num_tasks": 500, "mode": 1},
"Case7": {"num_tasks": 1000, "mode": 1},
"Case8": {"num_tasks": 2000, "mode": 1},
# AIC+AIV alternating (mode=2)
"Case9": {"num_tasks": 100, "mode": 2},
"Case10": {"num_tasks": 500, "mode": 2},
"Case11": {"num_tasks": 1000, "mode": 2},
"Case12": {"num_tasks": 2000, "mode": 2},
}

DEFAULT_CASE = "Case2"


def generate_inputs(params: dict) -> list:
num_tasks = params["num_tasks"]
mode = params["mode"]

output = torch.zeros(num_tasks * CACHE_LINE_ELEMS, dtype=torch.float32)

return [
("output", output),
("num_tasks", ctypes.c_int64(num_tasks)),
("mode", ctypes.c_int64(mode)),
]


def compute_golden(tensors: dict, params: dict) -> None:
num_tasks = params["num_tasks"]
output = torch.as_tensor(tensors["output"])

# Each independent task writes 1.0 to its cache-line-aligned slot
for i in range(num_tasks):
output[i * CACHE_LINE_ELEMS] = 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright (c) PyPTO Contributors.
* This program is free software, you can redistribute it and/or modify it under the terms and conditions of
* CANN Open Software License Agreement Version 2.0 (the "License").
* Please refer to the License for details. You may not use this file except in compliance with the License.
* THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
* INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
* See LICENSE in the root of the software repository for the full text of the License.
* -----------------------------------------------------------------------------------------------------------
*/
/**
* No-op AIC Kernel for Task Scaling
*
* Minimal cube kernel that performs a trivial write. Each task writes 1.0
* at its designated position in the output tensor, proving execution order.
*
* Args:
* args[0] = output tensor (INOUT) - single float32 element per task
*/

#include <cstdint>
#include <pto/pto-inst.hpp>

#include "tensor.h"

using namespace pto; // NOLINT(build/namespaces)

#ifndef __gm__
#define __gm__
#endif

#ifndef __aicore__
#define __aicore__ [aicore] // NOLINT(whitespace/braces)
#endif

extern "C" __aicore__ void kernel_entry(__gm__ int64_t* args) {
__gm__ Tensor* out_tensor = reinterpret_cast<__gm__ Tensor*>(args[0]);
__gm__ float* out = reinterpret_cast<__gm__ float*>(out_tensor->buffer.addr) + out_tensor->start_offset;
*out = 1.0f;
}
Loading
Loading