Skip to content

[Bug] Head-of-Line Blocking in Operand Collector Arbitration under Sub-Core Mode #334

@reoLantern

Description

@reoLantern

Description

I have identified a logic issue in the Operand Collector (OC) arbitration when running in Sub-Core mode.

In the sub-core configuration, Collector Units (CUs) are hard-partitioned among sub-cores (schedulers). However, the OC input port arbitration logic (allocate_cu) selects the candidate instruction based on the global oldest timestamp across all sub-cores sharing that port.

If the "oldest" instruction belongs to a sub-core whose assigned CUs are full, the allocation fails, and the arbiter stops processing that port for the current cycle. This prevents younger instructions from other sub-cores (whose assigned CUs might be free) from being issued, causing unnecessary Head-of-Line (HoL) blocking.

Code Analysis

1. Global "Oldest" Selection
In opndcoll_rfu_t::allocate_cu, the code retrieves the ready register ID by calling get_ready_reg_id():

// gpgpu-sim/src/gpgpu-sim/shader.cc

void opndcoll_rfu_t::allocate_cu(unsigned port_num) {
  input_port_t &inp = m_in_ports[port_num];
  for (unsigned i = 0; i < inp.m_in.size(); i++) {
    if ((*inp.m_in[i]).has_ready()) {
      // <--- ISSUE HERE: This returns the oldest instruction regardless of sub-core ID
      unsigned reg_id = (*inp.m_in[i]).get_ready_reg_id(); 
      
      // ... (Calculation of cuLowerBound / cuUpperBound based on sched_id) ...

      for (unsigned k = cuLowerBound; k < cuUpperBound; k++) {
         // Try to allocate specific CUs...
      }
      // If allocation fails here (because this sub-core's CUs are full),
      // the loop continues to the next 'i' (next pipeline reg), 
      // essentially skipping other ready instructions in the current register set.
    }
  }
}

2. The Selection Logic
register_set::get_ready_reg_id iterates through all registers (which map to different sub-cores) and picks the one with the lowest UID:

unsigned get_ready_reg_id() {
    // ...
    for (unsigned i = 0; i < regs.size(); i++) {
      if (not regs[i]->empty()) {
        if (ready and (*ready)->get_uid() < regs[i]->get_uid()) {
          // ready is oldest
        } else {
          ready = &regs[i];
          reg_id = i;
        }
      }
    }
    return reg_id; // Returns the index of the absolute oldest instruction
}

Scenario & Expected Behavior

Scenario:

  • Sub-Core 0: Has a valid instruction (Warp 4). Its assigned CUs are Free.
  • Sub-Core 1: Has a valid instruction (Warp 5). Its assigned CUs are Full.
  • Timing: Warp 5 (Sub-Core 1) is slightly "older" (smaller UID) than Warp 4.

Current Behavior:

  1. get_ready_reg_id selects Warp 5 because it is older.
  2. allocate_cu checks Sub-Core 1's CUs. They are full.
  3. Allocation fails. The function exits or moves to the next pipeline register type, ignoring Warp 4.
  4. Warp 4 is stalled despite having available resources.

Expected Behavior:
Since sub-cores partition the Register File and Collector Units physically in hardware, they should operate independently. If the oldest instruction (Warp 5) cannot proceed due to resource contention specific to Sub-Core 1, the arbiter should check if instructions from other sub-cores (e.g., Warp 4 from Sub-Core 0) can proceed.

Suggested Fix

The allocate_cu logic should be modified to iterate through available sub-cores or continue searching for a candidate instruction if the "oldest" one fails to allocate a CU due to partition-specific constraints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions