MultiPeerTransport: buffer exchange + topology APIs#991
Closed
dmwu wants to merge 1 commit intometa-pytorch:mainfrom
Closed
MultiPeerTransport: buffer exchange + topology APIs#991dmwu wants to merge 1 commit intometa-pytorch:mainfrom
dmwu wants to merge 1 commit intometa-pytorch:mainfrom
Conversation
Contributor
Summary: [pipes] MultiPeerTransport: buffer exchange and topology APIs Extend MultiPeerTransport to be a self-contained service that provides all buffer management APIs the window layer (HostWindow) needs. Previously, buffer registration and exchange were scattered across MultipeerIbgdaTransport and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean public interface for both IBGDA and NVL buffer workflows. New MultiPeerTransport APIs: - registerIbgdaBuffer(ptr, size): RDMA-register a GPU buffer, returns IbgdaLocalBuffer with lkey for local RDMA operations. - deregisterIbgdaBuffer(localBuf): RDMA-deregister a previously registered buffer. - exchangeIbgdaBuffer(localBuf): All-to-all exchange of RDMA buffer metadata across ranks, returns vector<IbgdaRemoteBuffer> with rkeys for remote access. - nvl_bootstrap(): Expose the NVL bootstrap adapter for NVLink IPC exchange. - exchangeNvlBuffer(ptr, size): IPC exchange of a GPU buffer across NVL peers, returns mapped peer pointers for direct NVLink access. - unmapNvlBuffers(mappedPtrs): Unmap previously IPC-exchanged NVL buffers. - get_device_handle(): Returns the lightweight MultiPeerDeviceHandle for kernels. Other changes: - MultiPeerNvlTransport: Remove getMultiPeerDeviceTransport() — the device transport wrapper is superseded by the unified DeviceWindow. - Delete MultiPeerDeviceTransport.cuh: no longer needed as DeviceWindow provides all transport, signal, barrier, counter, and data transfer APIs directly. - Update BUCK: remove multi_peer_device_transport target and dependencies. Reviewed By: siyengar Differential Revision: D95281294
Contributor
|
This pull request has been merged in 5a0cd8a. |
alexk101
added a commit
to alexk101/torchcomms
that referenced
this pull request
Mar 16, 2026
Removed the MultiPeerDeviceTransport.cuh file from the RCCLX CMakeLists, which was deprecated in meta-pytorch#991
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
[pipes] MultiPeerTransport: buffer exchange and topology APIs
Extend MultiPeerTransport to be a self-contained service that provides all
buffer management APIs the window layer (HostWindow) needs. Previously,
buffer registration and exchange were scattered across MultipeerIbgdaTransport
and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean
public interface for both IBGDA and NVL buffer workflows.
New MultiPeerTransport APIs:
IbgdaLocalBuffer with lkey for local RDMA operations.
buffer.
across ranks, returns vector with rkeys for remote access.
returns mapped peer pointers for direct NVLink access.
Other changes:
transport wrapper is superseded by the unified DeviceWindow.
all transport, signal, barrier, counter, and data transfer APIs directly.
Reviewed By: siyengar
Differential Revision: D95281294