-
Notifications
You must be signed in to change notification settings - Fork 602
Description
Description
I have identified a logic issue in the Operand Collector (OC) arbitration when running in Sub-Core mode.
In the sub-core configuration, Collector Units (CUs) are hard-partitioned among sub-cores (schedulers). However, the OC input port arbitration logic (allocate_cu) selects the candidate instruction based on the global oldest timestamp across all sub-cores sharing that port.
If the "oldest" instruction belongs to a sub-core whose assigned CUs are full, the allocation fails, and the arbiter stops processing that port for the current cycle. This prevents younger instructions from other sub-cores (whose assigned CUs might be free) from being issued, causing unnecessary Head-of-Line (HoL) blocking.
Code Analysis
1. Global "Oldest" Selection
In opndcoll_rfu_t::allocate_cu, the code retrieves the ready register ID by calling get_ready_reg_id():
// gpgpu-sim/src/gpgpu-sim/shader.cc
void opndcoll_rfu_t::allocate_cu(unsigned port_num) {
input_port_t &inp = m_in_ports[port_num];
for (unsigned i = 0; i < inp.m_in.size(); i++) {
if ((*inp.m_in[i]).has_ready()) {
// <--- ISSUE HERE: This returns the oldest instruction regardless of sub-core ID
unsigned reg_id = (*inp.m_in[i]).get_ready_reg_id();
// ... (Calculation of cuLowerBound / cuUpperBound based on sched_id) ...
for (unsigned k = cuLowerBound; k < cuUpperBound; k++) {
// Try to allocate specific CUs...
}
// If allocation fails here (because this sub-core's CUs are full),
// the loop continues to the next 'i' (next pipeline reg),
// essentially skipping other ready instructions in the current register set.
}
}
}2. The Selection Logic
register_set::get_ready_reg_id iterates through all registers (which map to different sub-cores) and picks the one with the lowest UID:
unsigned get_ready_reg_id() {
// ...
for (unsigned i = 0; i < regs.size(); i++) {
if (not regs[i]->empty()) {
if (ready and (*ready)->get_uid() < regs[i]->get_uid()) {
// ready is oldest
} else {
ready = ®s[i];
reg_id = i;
}
}
}
return reg_id; // Returns the index of the absolute oldest instruction
}Scenario & Expected Behavior
Scenario:
- Sub-Core 0: Has a valid instruction (Warp 4). Its assigned CUs are Free.
- Sub-Core 1: Has a valid instruction (Warp 5). Its assigned CUs are Full.
- Timing: Warp 5 (Sub-Core 1) is slightly "older" (smaller UID) than Warp 4.
Current Behavior:
get_ready_reg_idselects Warp 5 because it is older.allocate_cuchecks Sub-Core 1's CUs. They are full.- Allocation fails. The function exits or moves to the next pipeline register type, ignoring Warp 4.
- Warp 4 is stalled despite having available resources.
Expected Behavior:
Since sub-cores partition the Register File and Collector Units physically in hardware, they should operate independently. If the oldest instruction (Warp 5) cannot proceed due to resource contention specific to Sub-Core 1, the arbiter should check if instructions from other sub-cores (e.g., Warp 4 from Sub-Core 0) can proceed.
Suggested Fix
The allocate_cu logic should be modified to iterate through available sub-cores or continue searching for a candidate instruction if the "oldest" one fails to allocate a CU due to partition-specific constraints.