Skip to content

fix(handshake): add store-store barriers and sim bootstrap assertion#420

Open
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:handshake-fix
Open

fix(handshake): add store-store barriers and sim bootstrap assertion#420
yanghaoran29 wants to merge 1 commit intohw-native-sys:mainfrom
yanghaoran29:handshake-fix

Conversation

@yanghaoran29
Copy link
Copy Markdown
Contributor

Five fixes to the AICPU-AICore handshake protocol, applied to both a2a3 and a5:

  1. AICore side (Phase 2a): add OUT_OF_ORDER_STORE_BARRIER() between writing physical_core_id and aicore_regs_ready=1 in all aicore_executor.cpp files. Without it, aarch64 weak memory ordering allows the signal store to become visible before physical_core_id, causing AICPU to index registers with an uninitialized value.

  2. AICPU side (Phase 1): add OUT_OF_ORDER_STORE_BARRIER() between writing task and aicpu_ready=1 in handshake_all_cores() for aicpu_build_graph and tensormap_and_ringbuffer (a2a3 and a5). Without it, AICore may cache a null payload pointer immediately after seeing aicpu_ready=1, causing all subsequent kernel dispatches to dereference a null pointer.

  3. AICPU emergency_shutdown: add OUT_OF_ORDER_STORE_BARRIER() before aicpu_regs_ready=1 in all three a2a3 runtimes (aicpu_build_graph, tensormap_and_ringbuffer, host_build_graph) to ensure any preceding stores are globally visible before the signal is observed by AICore.

  4. Sim bootstrap assertion: add #ifdef __CPU_SIM check that physical_core_id == worker_id in handshake_all_cores(). Sim assigns a 1:1 mapping; a mismatch indicates a broken platform bootstrap. __CPU_SIM is added to sim/aicpu/CMakeLists.txt for AICPU sim builds.

  5. memory_barrier.h: define OUT_OF_ORDER_STORE_BARRIER() (dmb ishst) for both a2a3 and a5 platforms so AICPU code can use the same macro as AICore.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the OUT_OF_ORDER_STORE_BARRIER() macro to ensure proper memory visibility during core handshakes and emergency shutdowns across multiple executor components. Additionally, it defines the __CPU_SIM macro for simulation builds and adds a validation check to enforce a 1:1 mapping between logical and physical core IDs in simulation mode. I have no feedback to provide.

@yanghaoran29 yanghaoran29 force-pushed the handshake-fix branch 2 times, most recently from 8f2cc0d to f5b95a7 Compare March 31, 2026 14:57
@yanghaoran29
Copy link
Copy Markdown
Contributor Author

Handshake Struct Memory Ordering Fixes

Background

The Handshake struct is a shared-memory region used for initialization handshakes between AICPU and AICore. It resides in global memory (__gm__) and is accessed concurrently by two independent processors (AICPU thread / AICore hardware core).

Struct definition (using aicpu_build_graph as an example):

struct Handshake {
    volatile uint32_t aicpu_ready;         // AICPU -> AICore
    volatile uint64_t task;                // AICPU -> AICore (init only)
    volatile CoreType core_type;           // AICore -> AICPU
    volatile uint64_t perf_records_addr;   // AICPU -> AICore (profiling)
    volatile uint32_t physical_core_id;    // AICore -> AICPU
    volatile uint32_t aicpu_regs_ready;    // AICPU -> AICore
    volatile uint32_t aicore_regs_ready;   // AICore -> AICPU
    volatile uint32_t aicore_done;         // AICore -> AICPU
    ...
} __attribute__((aligned(64)));

The handshake protocol has four phases:

Phase Action Writer Reader
Phase 1 AICPU writes task -> writes aicpu_ready=1 AICPU AICore (polling)
Phase 2a AICore writes physical_core_id -> writes aicore_regs_ready=1 AICore AICPU (polling)
Phase 2b AICPU writes aicpu_regs_ready=1 AICPU AICore (polling)
Phase 3 AICore writes core_type -> writes aicore_done AICore AICPU (polling)

Problem Manifestation

Excerpt from one runtime log:

[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 0: type=AIC, physical_id=0, reg_addr=0x3706a510
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 1: type=AIC, physical_id=1, reg_addr=0x3706aa10
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 2: type=AIC, physical_id=2, reg_addr=0x3706af10
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 3: type=AIC, physical_id=3, reg_addr=0x3706b410
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 4: type=AIC, physical_id=4, reg_addr=0x3706b910
[INFO] platform_aicpu_affinity_gate: [platform_aicpu_affinity.cpp:20] AICPU affinity gate (sim): thread idx=3 DROPPED (logical=3, launched=6)
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 5: type=AIC, physical_id=5, reg_addr=0x3706be10
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 6: type=AIC, physical_id=6, reg_addr=0x3706c310
[INFO] platform_aicpu_affinity_gate: [platform_aicpu_affinity.cpp:20] AICPU affinity gate (sim): thread idx=4 DROPPED (logical=3, launched=6)
[INFO] platform_aicpu_affinity_gate: [platform_aicpu_affinity.cpp:20] AICPU affinity gate (sim): thread idx=5 DROPPED (logical=3, launched=6)
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 7: type=AIC, physical_id=0, reg_addr=0x3706a510
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 8: type=AIC, physical_id=8, reg_addr=0x3706cd10
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 9: type=AIC, physical_id=9, reg_addr=0x3706d210
[INFO] handshake_all_cores: [aicpu_executor.cpp:373]   Core 10: type=AIC, physical_id=10, reg_addr=0x3706d710

physical_id for Core 7 should be 7, but observed value is 0, which causes failures.

Fixes

Fix 1: Missing store-store barrier in Phase 1 (AICPU side, task -> aicpu_ready)

Affected files:

  • src/a2a3/runtime/aicpu_build_graph/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

Before vs after:

// Before
// task must be written BEFORE aicpu_ready so AICore sees it after waking up
all_handshakes[i].task = reinterpret_cast<uint64_t>(&s_pto2_payload_per_core[i]);  // store A
all_handshakes[i].aicpu_ready = 1;   // store B (signal)

// After
all_handshakes[i].task = reinterpret_cast<uint64_t>(&s_pto2_payload_per_core[i]);  // store A
OUT_OF_ORDER_STORE_BARRIER();                                                        // <- added
all_handshakes[i].aicpu_ready = 1;   // store B (signal)

Reordering prevented:

store A (task = &s_pto2_payload_per_core[i])  -+
                                                | OUT_OF_ORDER_STORE_BARRIER() blocks this reorder
store B (aicpu_ready = 1, signal)             -+

That is, it prevents CPU from making store B globally visible before store A.

Why code comments are not enough: The comment already says "task must be written BEFORE aicpu_ready", but comments only constrain source-level intent and compiler output order. Under the weakly ordered AArch64 model, CPU hardware can still make store B observable before store A on remote cores. We must use dmb ishst (OUT_OF_ORDER_STORE_BARRIER()) to enforce at hardware level that store A is visible to all cores before store B can issue.

Consequence: In Phase 1, once AICore sees aicpu_ready != 0, it immediately caches my_hank->task (payload = reinterpret_cast<...>(my_hank->task)). If store B is observed before store A, AICore may cache a null payload pointer and later dispatch all kernels through a null pointer, causing undefined behavior. This path is executed on every runtime initialization and does not require special timing to occur.

Fix 2: Missing store-store barrier in Phase 2a (AICore side across three runtimes)

Affected files:

  • src/a2a3/runtime/aicpu_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp

Before vs after:

// Before
my_hank->physical_core_id = get_physical_core_id();  // store A
my_hank->aicore_regs_ready = 1;                       // store B (signal)
dcci(&my_hank->aicore_regs_ready, SINGLE_CACHE_LINE, CACHELINE_OUT);

// After
my_hank->physical_core_id = get_physical_core_id();  // store A
OUT_OF_ORDER_STORE_BARRIER();                         // <- added
my_hank->aicore_regs_ready = 1;                       // store B (signal)
dcci(&my_hank->aicore_regs_ready, SINGLE_CACHE_LINE, CACHELINE_OUT);

Reordering prevented:

store A (physical_core_id)  -+
                              | OUT_OF_ORDER_STORE_BARRIER() blocks this reorder
store B (aicore_regs_ready)  -+

That is, it prevents CPU from making store B globally visible before store A.

Why dcci is insufficient: dcci (seq_cst) is placed after store B. It enforces flushing order after store B, but cannot prevent other cores from observing store B before store A. To constrain the relative visibility of two stores, the fence must be placed between them.

Consequence: Under AArch64 weak ordering, after AICPU polls aicore_regs_ready != 0, it calls rmb() (dsb ld). But rmb() only prevents AICPU's own load reordering; it cannot compensate for the issue order of the two stores on AICore side. Without this barrier, AICPU may read aicore_regs_ready=1 before physical_core_id becomes globally visible, then use a zero physical_core_id to index register addresses, leading to wrong address mapping.

Fix 3: Missing store-store barrier in emergency_shutdown (AICPU side)

Affected files:

  • src/a2a3/runtime/aicpu_build_graph/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp

Before vs after:

// Before (emergency_shutdown)
hank->aicpu_regs_ready = 1;

// After
OUT_OF_ORDER_STORE_BARRIER();  // <- added (dmb ishst, symmetric with AICore side)
hank->aicpu_regs_ready = 1;

Reordering prevented:

Any prior AICPU stores (all writes before emergency_shutdown call)
        <> OUT_OF_ORDER_STORE_BARRIER() prevents early observation of aicpu_regs_ready=1
store (aicpu_regs_ready = 1, signal)

Fix 4: Consistency check worker_id == physical_core_id in sim mode (AICPU side across three runtimes)

Affected files:

  • src/a2a3/runtime/aicpu_build_graph/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp

In handshake_all_cores() of all three runtimes, immediately after the range check physical_core_id >= max_physical_cores_count, a compile-time conditional check is added:

#ifdef __CPU_SIM
        if (physical_core_id != static_cast<uint32_t>(i)) {
            DEV_ERROR("Core %d reported physical_core_id=%u (expected %d): sim requires 1:1 mapping",
                i, physical_core_id, i);
            handshake_failed = true;
            continue;
        }
#endif

Activation condition: __CPU_SIM is defined only for sim builds. It was already defined for AICore by src/a2a3/platform/sim/aicore/CMakeLists.txt. On AICPU side, target_compile_definitions(aicpu_kernel PRIVATE __CPU_SIM) is newly added to src/a2a3/platform/sim/aicpu/CMakeLists.txt, so the check is actually enabled in AICPU sim builds.

Failure path: handshake_failed = true -> after loop ends, emergency_shutdown(runtime) is triggered -> function returns -1 -> caller sets init_failed_ = true -> runtime terminates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant