Skip to content

Unified HostWindow + DeviceWindow#990

Closed
dmwu wants to merge 3 commits intometa-pytorch:mainfrom
dmwu:export-D95014942
Closed

Unified HostWindow + DeviceWindow#990
dmwu wants to merge 3 commits intometa-pytorch:mainfrom
dmwu:export-D95014942

Conversation

@dmwu
Copy link
Copy Markdown
Contributor

@dmwu dmwu commented Mar 8, 2026

Summary:
[pipes] Unified DeviceWindow + HostWindow

Introduce a unified DeviceWindow class and its host-side counterpart HostWindow,
replacing the fragmented DeviceWindowSignal/DeviceWindowBarrier/DeviceWindowMemory
hierarchy with a single flat device-side handle.

DeviceWindow (window/DeviceWindow.cuh):

  • Single device-side class holding all signal, barrier, counter, and data-transfer
    state directly (no sub-objects). Passed by value to kernels.
  • Uses MultiPeerDeviceHandle for NVL vs IBGDA transport dispatch.
  • Per-peer signals: one slot per (peer, signal_id). wait_signal() sums across all
    peer rows; wait_signal_from() reads one specific peer slot in O(1).
  • Flat barriers: per-peer-type accumulation (NVL peers via GPU atomics, IBGDA peers
    via RDMA atomics). barrier() synchronizes all peers.
  • Per-peer counters (IBGDA-only): local NIC completion tracking via companion QP
    loopback.
  • Pre-computed peer index maps: O(1) rankToNvlPeerIndex_[rank] and
    rankToIbgdaPeerIndex_[rank] lookups instead of linear scans.
  • Data transfer APIs: put(), send(), recv(), put_signal() dispatching to the
    appropriate transport.

HostWindow (window/HostWindow.h/.cc):

  • Host-side RAII manager taking MultiPeerTransport& (one-way reference, no circular
    ownership) + WindowConfig.
  • Allocates dual NVL (GpuMemHandler/IPC) + IBGDA (cudaMalloc + RDMA registration)
    backing buffers for signals, barriers, and counters.
  • Optional user data buffer: if provided, registers and exchanges it on both NVL
    (IPC) and IBGDA (RDMA) sides.
  • getDeviceWindow() constructs the flat DeviceWindow in a single call.

Cleanup:

  • Delete deprecated DeviceWindowSignal.cuh, DeviceWindowBarrier.cuh,
    DeviceWindowMemory.cuh, WindowMemory.{h,cc}.
  • Delete MultiPeerDeviceTransport.cuh (superseded by DeviceWindow).
  • Delete old tests: MultiPeerDeviceTransportTest, WindowMemoryTest.
  • Add new tests: DeviceWindowTest, HostWindowTest.

Reviewed By: siyengar

Differential Revision: D95014942

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 8, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 8, 2026

@dmwu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95014942.

@dmwu dmwu force-pushed the export-D95014942 branch from b54ac38 to be47a15 Compare March 9, 2026 16:09
dmwu added 3 commits March 9, 2026 20:58
Differential Revision: D94633071
Summary:
[pipes] MultiPeerTransport: buffer exchange and topology APIs

Extend MultiPeerTransport to be a self-contained service that provides all
buffer management APIs the window layer (HostWindow) needs. Previously,
buffer registration and exchange were scattered across MultipeerIbgdaTransport
and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean
public interface for both IBGDA and NVL buffer workflows.

New MultiPeerTransport APIs:
- registerIbgdaBuffer(ptr, size): RDMA-register a GPU buffer, returns
  IbgdaLocalBuffer with lkey for local RDMA operations.
- deregisterIbgdaBuffer(localBuf): RDMA-deregister a previously registered
  buffer.
- exchangeIbgdaBuffer(localBuf): All-to-all exchange of RDMA buffer metadata
  across ranks, returns vector<IbgdaRemoteBuffer> with rkeys for remote access.
- nvl_bootstrap(): Expose the NVL bootstrap adapter for NVLink IPC exchange.
- exchangeNvlBuffer(ptr, size): IPC exchange of a GPU buffer across NVL peers,
  returns mapped peer pointers for direct NVLink access.
- unmapNvlBuffers(mappedPtrs): Unmap previously IPC-exchanged NVL buffers.
- get_device_handle(): Returns the lightweight MultiPeerDeviceHandle for kernels.

Other changes:
- MultiPeerNvlTransport: Remove getMultiPeerDeviceTransport() — the device
  transport wrapper is superseded by the unified DeviceWindow.
- Delete MultiPeerDeviceTransport.cuh: no longer needed as DeviceWindow provides
  all transport, signal, barrier, counter, and data transfer APIs directly.
- Update BUCK: remove multi_peer_device_transport target and dependencies.

Differential Revision: D95281294

Reviewed By: siyengar
Summary:
Pull Request resolved: meta-pytorch#990

[pipes] Unified DeviceWindow + HostWindow

Introduce a unified DeviceWindow class and its host-side counterpart HostWindow,
replacing the fragmented DeviceWindowSignal/DeviceWindowBarrier/DeviceWindowMemory
hierarchy with a single flat device-side handle.

DeviceWindow (window/DeviceWindow.cuh):
- Single device-side class holding all signal, barrier, counter, and data-transfer
  state directly (no sub-objects). Passed by value to kernels.
- Uses MultiPeerDeviceHandle for NVL vs IBGDA transport dispatch.
- Per-peer signals: one slot per (peer, signal_id). wait_signal() sums across all
  peer rows; wait_signal_from() reads one specific peer slot in O(1).
- Flat barriers: per-peer-type accumulation (NVL peers via GPU atomics, IBGDA peers
  via RDMA atomics). barrier() synchronizes all peers.
- Per-peer counters (IBGDA-only): local NIC completion tracking via companion QP
  loopback.
- Pre-computed peer index maps: O(1) rankToNvlPeerIndex_[rank] and
  rankToIbgdaPeerIndex_[rank] lookups instead of linear scans.
- Data transfer APIs: put(), send(), recv(), put_signal() dispatching to the
  appropriate transport.

HostWindow (window/HostWindow.h/.cc):
- Host-side RAII manager taking MultiPeerTransport& (one-way reference, no circular
  ownership) + WindowConfig.
- Allocates dual NVL (GpuMemHandler/IPC) + IBGDA (cudaMalloc + RDMA registration)
  backing buffers for signals, barriers, and counters.
- Optional user data buffer: if provided, registers and exchanges it on both NVL
  (IPC) and IBGDA (RDMA) sides.
- getDeviceWindow() constructs the flat DeviceWindow in a single call.

Cleanup:
- Delete deprecated DeviceWindowSignal.cuh, DeviceWindowBarrier.cuh,
  DeviceWindowMemory.cuh, WindowMemory.{h,cc}.
- Delete MultiPeerDeviceTransport.cuh (superseded by DeviceWindow).
- Delete old tests: MultiPeerDeviceTransportTest, WindowMemoryTest.
- Add new tests: DeviceWindowTest, HostWindowTest.

Reviewed By: siyengar

Differential Revision: D95014942
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 10, 2026

This pull request has been merged in 65f3782.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant