Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The Graph-fanin_N test case produces incorrect results when fanin_width > 16 (e.g., Fanin24, Fanin32), reporting:
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
Root cause is a dual overflow:
1. Silent dependency truncation (PTO2_MAX_INPUTS=16)
In pto_orchestrator.cpp, the fanin_states[] array used to collect fan-in dependencies is sized to PTO2_MAX_INPUTS (16). When a barrier task has more than 16 INPUT dependencies (producer tasks), the excess dependencies are silently discarded with no error or log message:
// pto_orchestrator.cpp:471-475
if (!already_added) {
if (fanin_count < PTO2_MAX_INPUTS) { // hard limit of 16
fanin_states[fanin_count++] = prod_state;
}
// exceeds 16 → silently dropped, no error reported
}
This causes the barrier task to only wait for the first 16 producers instead of all N.
2. Tensor argument array out-of-bounds write (MAX_TENSOR_ARGS=16)
The barrier task's arguments consist of 1 INOUT (result) + N INPUTs (producer outputs). When N=24, the total is 25 tensor args, exceeding MAX_TENSOR_ARGS=16. When payload->init() writes into PTO2TaskPayload::tensors[MAX_TENSOR_ARGS], it causes an out-of-bounds write that corrupts the subsequent dispatch_args memory region, resulting in the barrier kernel receiving incorrect tensor pointers.
Relevant hardcoded constants (pto_types.h):
#define MAX_TENSOR_ARGS 16 // Barrier needs 1+N args; overflows when N>15
#define PTO2_MAX_INPUTS 16 // Dependency tracking limit
Fixed-size arrays in PTO2TaskPayload (pto_runtime2_types.h:378-380):
PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS]; // [16]
Tensor tensors[MAX_TENSOR_ARGS]; // [16]
Steps to Reproduce
# Fanin4 — passes
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin4
# Fanin24 — fails
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin24
# Fanin32 — fails
python examples/scripts/run_example.py \
-k tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/kernels \
-g tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/golden.py \
-p onboard --case Fanin32
Trigger condition: any single task whose fan-in dependency count (number of INPUT tensor args) exceeds 16.
| Case |
Producers |
Actually tracked deps |
Result |
| Fanin4 |
4 |
4 |
PASS |
| Fanin16 |
16 |
16 |
PASS |
| Fanin24 |
24 |
16 (truncated) |
FAIL |
| Fanin32 |
32 |
16 (truncated) |
FAIL |
Expected Behavior
All fan-in cases (including Fanin24 and Fanin32) should pass correctly with output result=1.0 matching the golden value. Alternatively, when the runtime's capacity limit is exceeded, a clear error message should be reported instead of silently truncating dependencies.
Actual Behavior
[ERROR] TEST FAILED: Output 'result' does not match golden.
Mismatched elements: 1/1
rtol=1e-05, atol=1e-05
Silent dependency truncation combined with tensor arg array out-of-bounds write causes the barrier kernel to produce an incorrect result.
Git Commit ID
1d97ac5
Host Platform
Linux (aarch64)
Additional Context
Affected files:
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h:43 — PTO2_MAX_INPUTS definition
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:471-475 — dependency truncation logic
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h:378-380 — payload fixed-size arrays
tests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/ — triggering test case
Possible fix directions:
- Raise the limits: increase
PTO2_MAX_INPUTS, MAX_TENSOR_ARGS, etc. (increases per-task memory footprint)
- Multi-stage fan-in at orchestration layer: split N-way fan-in into a multi-level tree (e.g., 24 → 6 groups × 4-way → 1 × 6-way), ensuring each task stays within the 16-input limit
- Add bounds checking: emit an error in
Arg::add_input() or during orchestrator submission when tensor arg count exceeds the limit, instead of silently truncating
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The
Graph-fanin_Ntest case produces incorrect results whenfanin_width > 16(e.g., Fanin24, Fanin32), reporting:Root cause is a dual overflow:
1. Silent dependency truncation (
PTO2_MAX_INPUTS=16)In
pto_orchestrator.cpp, thefanin_states[]array used to collect fan-in dependencies is sized toPTO2_MAX_INPUTS (16). When a barrier task has more than 16 INPUT dependencies (producer tasks), the excess dependencies are silently discarded with no error or log message:This causes the barrier task to only wait for the first 16 producers instead of all N.
2. Tensor argument array out-of-bounds write (
MAX_TENSOR_ARGS=16)The barrier task's arguments consist of 1 INOUT (result) + N INPUTs (producer outputs). When N=24, the total is 25 tensor args, exceeding
MAX_TENSOR_ARGS=16. Whenpayload->init()writes intoPTO2TaskPayload::tensors[MAX_TENSOR_ARGS], it causes an out-of-bounds write that corrupts the subsequentdispatch_argsmemory region, resulting in the barrier kernel receiving incorrect tensor pointers.Relevant hardcoded constants (
pto_types.h):Fixed-size arrays in
PTO2TaskPayload(pto_runtime2_types.h:378-380):Steps to Reproduce
Trigger condition: any single task whose fan-in dependency count (number of INPUT tensor args) exceeds 16.
Expected Behavior
All fan-in cases (including Fanin24 and Fanin32) should pass correctly with output
result=1.0matching the golden value. Alternatively, when the runtime's capacity limit is exceeded, a clear error message should be reported instead of silently truncating dependencies.Actual Behavior
Silent dependency truncation combined with tensor arg array out-of-bounds write causes the barrier kernel to produce an incorrect result.
Git Commit ID
1d97ac5
Host Platform
Linux (aarch64)
Additional Context
Affected files:
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h:43—PTO2_MAX_INPUTSdefinitionsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp:471-475— dependency truncation logicsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h:378-380— payload fixed-size arraystests/st/a2a3/tensormap_and_ringbuffer/Graph-fanin_N/— triggering test casePossible fix directions:
PTO2_MAX_INPUTS,MAX_TENSOR_ARGS, etc. (increases per-task memory footprint)Arg::add_input()or during orchestrator submission when tensor arg count exceeds the limit, instead of silently truncating