Skip to content

MultiPeerTransport: buffer exchange + topology APIs#991

Closed
dmwu wants to merge 1 commit intometa-pytorch:mainfrom
dmwu:export-D95281294
Closed

MultiPeerTransport: buffer exchange + topology APIs#991
dmwu wants to merge 1 commit intometa-pytorch:mainfrom
dmwu:export-D95281294

Conversation

@dmwu
Copy link
Copy Markdown
Contributor

@dmwu dmwu commented Mar 8, 2026

Summary:
[pipes] MultiPeerTransport: buffer exchange and topology APIs

Extend MultiPeerTransport to be a self-contained service that provides all
buffer management APIs the window layer (HostWindow) needs. Previously,
buffer registration and exchange were scattered across MultipeerIbgdaTransport
and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean
public interface for both IBGDA and NVL buffer workflows.

New MultiPeerTransport APIs:

  • registerIbgdaBuffer(ptr, size): RDMA-register a GPU buffer, returns
    IbgdaLocalBuffer with lkey for local RDMA operations.
  • deregisterIbgdaBuffer(localBuf): RDMA-deregister a previously registered
    buffer.
  • exchangeIbgdaBuffer(localBuf): All-to-all exchange of RDMA buffer metadata
    across ranks, returns vector with rkeys for remote access.
  • nvl_bootstrap(): Expose the NVL bootstrap adapter for NVLink IPC exchange.
  • exchangeNvlBuffer(ptr, size): IPC exchange of a GPU buffer across NVL peers,
    returns mapped peer pointers for direct NVLink access.
  • unmapNvlBuffers(mappedPtrs): Unmap previously IPC-exchanged NVL buffers.
  • get_device_handle(): Returns the lightweight MultiPeerDeviceHandle for kernels.

Other changes:

  • MultiPeerNvlTransport: Remove getMultiPeerDeviceTransport() — the device
    transport wrapper is superseded by the unified DeviceWindow.
  • Delete MultiPeerDeviceTransport.cuh: no longer needed as DeviceWindow provides
    all transport, signal, barrier, counter, and data transfer APIs directly.
  • Update BUCK: remove multi_peer_device_transport target and dependencies.

Reviewed By: siyengar

Differential Revision: D95281294

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 8, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 8, 2026

@dmwu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95281294.

@dmwu dmwu force-pushed the export-D95281294 branch from 5c62e2c to 3858f97 Compare March 9, 2026 16:09
Summary:

[pipes] MultiPeerTransport: buffer exchange and topology APIs

Extend MultiPeerTransport to be a self-contained service that provides all
buffer management APIs the window layer (HostWindow) needs. Previously,
buffer registration and exchange were scattered across MultipeerIbgdaTransport
and MultiPeerNvlTransport internals. Now MultiPeerTransport exposes a clean
public interface for both IBGDA and NVL buffer workflows.

New MultiPeerTransport APIs:
- registerIbgdaBuffer(ptr, size): RDMA-register a GPU buffer, returns
  IbgdaLocalBuffer with lkey for local RDMA operations.
- deregisterIbgdaBuffer(localBuf): RDMA-deregister a previously registered
  buffer.
- exchangeIbgdaBuffer(localBuf): All-to-all exchange of RDMA buffer metadata
  across ranks, returns vector<IbgdaRemoteBuffer> with rkeys for remote access.
- nvl_bootstrap(): Expose the NVL bootstrap adapter for NVLink IPC exchange.
- exchangeNvlBuffer(ptr, size): IPC exchange of a GPU buffer across NVL peers,
  returns mapped peer pointers for direct NVLink access.
- unmapNvlBuffers(mappedPtrs): Unmap previously IPC-exchanged NVL buffers.
- get_device_handle(): Returns the lightweight MultiPeerDeviceHandle for kernels.

Other changes:
- MultiPeerNvlTransport: Remove getMultiPeerDeviceTransport() — the device
  transport wrapper is superseded by the unified DeviceWindow.
- Delete MultiPeerDeviceTransport.cuh: no longer needed as DeviceWindow provides
  all transport, signal, barrier, counter, and data transfer APIs directly.
- Update BUCK: remove multi_peer_device_transport target and dependencies.

Reviewed By: siyengar

Differential Revision: D95281294
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 10, 2026

This pull request has been merged in 5a0cd8a.

alexk101 added a commit to alexk101/torchcomms that referenced this pull request Mar 16, 2026
Removed the MultiPeerDeviceTransport.cuh file from the RCCLX CMakeLists, which was deprecated in meta-pytorch#991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant