Closed
Conversation
Contributor
Differential Revision: D94633071
Summary: [pipes] MultiPeerTransport: buffer exchange and topology APIs Extend MultiPeerTransport to be a self-contained service that provides all buffer management APIs the window layer (HostWindow) needs. Previously, buffer registration and exchange were scattered across MultipeerIbgdaTransport and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean public interface for both IBGDA and NVL buffer workflows. New MultiPeerTransport APIs: - registerIbgdaBuffer(ptr, size): RDMA-register a GPU buffer, returns IbgdaLocalBuffer with lkey for local RDMA operations. - deregisterIbgdaBuffer(localBuf): RDMA-deregister a previously registered buffer. - exchangeIbgdaBuffer(localBuf): All-to-all exchange of RDMA buffer metadata across ranks, returns vector<IbgdaRemoteBuffer> with rkeys for remote access. - nvl_bootstrap(): Expose the NVL bootstrap adapter for NVLink IPC exchange. - exchangeNvlBuffer(ptr, size): IPC exchange of a GPU buffer across NVL peers, returns mapped peer pointers for direct NVLink access. - unmapNvlBuffers(mappedPtrs): Unmap previously IPC-exchanged NVL buffers. - get_device_handle(): Returns the lightweight MultiPeerDeviceHandle for kernels. Other changes: - MultiPeerNvlTransport: Remove getMultiPeerDeviceTransport() — the device transport wrapper is superseded by the unified DeviceWindow. - Delete MultiPeerDeviceTransport.cuh: no longer needed as DeviceWindow provides all transport, signal, barrier, counter, and data transfer APIs directly. - Update BUCK: remove multi_peer_device_transport target and dependencies. Differential Revision: D95281294 Reviewed By: siyengar
Summary: Pull Request resolved: meta-pytorch#990 [pipes] Unified DeviceWindow + HostWindow Introduce a unified DeviceWindow class and its host-side counterpart HostWindow, replacing the fragmented DeviceWindowSignal/DeviceWindowBarrier/DeviceWindowMemory hierarchy with a single flat device-side handle. DeviceWindow (window/DeviceWindow.cuh): - Single device-side class holding all signal, barrier, counter, and data-transfer state directly (no sub-objects). Passed by value to kernels. - Uses MultiPeerDeviceHandle for NVL vs IBGDA transport dispatch. - Per-peer signals: one slot per (peer, signal_id). wait_signal() sums across all peer rows; wait_signal_from() reads one specific peer slot in O(1). - Flat barriers: per-peer-type accumulation (NVL peers via GPU atomics, IBGDA peers via RDMA atomics). barrier() synchronizes all peers. - Per-peer counters (IBGDA-only): local NIC completion tracking via companion QP loopback. - Pre-computed peer index maps: O(1) rankToNvlPeerIndex_[rank] and rankToIbgdaPeerIndex_[rank] lookups instead of linear scans. - Data transfer APIs: put(), send(), recv(), put_signal() dispatching to the appropriate transport. HostWindow (window/HostWindow.h/.cc): - Host-side RAII manager taking MultiPeerTransport& (one-way reference, no circular ownership) + WindowConfig. - Allocates dual NVL (GpuMemHandler/IPC) + IBGDA (cudaMalloc + RDMA registration) backing buffers for signals, barriers, and counters. - Optional user data buffer: if provided, registers and exchanges it on both NVL (IPC) and IBGDA (RDMA) sides. - getDeviceWindow() constructs the flat DeviceWindow in a single call. Cleanup: - Delete deprecated DeviceWindowSignal.cuh, DeviceWindowBarrier.cuh, DeviceWindowMemory.cuh, WindowMemory.{h,cc}. - Delete MultiPeerDeviceTransport.cuh (superseded by DeviceWindow). - Delete old tests: MultiPeerDeviceTransportTest, WindowMemoryTest. - Add new tests: DeviceWindowTest, HostWindowTest. Reviewed By: siyengar Differential Revision: D95014942
Contributor
|
This pull request has been merged in 65f3782. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
[pipes] Unified DeviceWindow + HostWindow
Introduce a unified DeviceWindow class and its host-side counterpart HostWindow,
replacing the fragmented DeviceWindowSignal/DeviceWindowBarrier/DeviceWindowMemory
hierarchy with a single flat device-side handle.
DeviceWindow (window/DeviceWindow.cuh):
state directly (no sub-objects). Passed by value to kernels.
peer rows; wait_signal_from() reads one specific peer slot in O(1).
via RDMA atomics). barrier() synchronizes all peers.
loopback.
rankToIbgdaPeerIndex_[rank] lookups instead of linear scans.
appropriate transport.
HostWindow (window/HostWindow.h/.cc):
ownership) + WindowConfig.
backing buffers for signals, barriers, and counters.
(IPC) and IBGDA (RDMA) sides.
Cleanup:
DeviceWindowMemory.cuh, WindowMemory.{h,cc}.
Reviewed By: siyengar
Differential Revision: D95014942