Skip to content

Conversation

@Sa4dUs
Copy link
Contributor

@Sa4dUs Sa4dUs commented Oct 21, 2025

This PR implements the minimal mechanisms required to run a small subset of arbitrary offload kernels without relying on hardcoded names or metadata.

  • offload(kernel, (..args)): an intrinsic that generates the necessary host-side LLVM-IR code.
  • rustc_offload_kernel: a builtin attribute that marks device kernels to be handled appropriately.

Example usage (pseudocode):

fn kernel(x: *mut [f64; 128]) {
    core::intrinsics::offload(kernel_1, (x,))
}

#[cfg(target_os = "linux")]
extern "C" {
    pub fn kernel_1(array_b: *mut [f64; 128]);
}

#[cfg(not(target_os = "linux"))]
#[rustc_offload_kernel]
extern "gpu-kernel" fn kernel_1(x: *mut [f64; 128]) {
    unsafe { (*x)[0] = 21.0 };
}

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 21, 2025
@ZuseZ4 ZuseZ4 self-assigned this Oct 21, 2025
@Sa4dUs Sa4dUs force-pushed the offload-intrinsic branch from 9118683 to 23722aa Compare October 21, 2025 19:45
@rust-log-analyzer

This comment has been minimized.

@ZuseZ4 ZuseZ4 added the F-gpu_offload `#![feature(gpu_offload)]` label Oct 22, 2025
}

pub fn from_ty<'tcx>(tcx: TyCtxt<'tcx>, ty: Ty<'tcx>) -> Self {
OffloadMetadata { payload_size: get_payload_size(tcx, ty), mode: TransferKind::Both }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have the code here, I would add a small check for & or byVal (implies Mode ToGPU), vs &mut (implies Both).

In the future we would hope to analyze the & or byval case more, if we never read from it (before writing) then we could use a new mode 4, which allocates directly on the gpu.

@ZuseZ4 ZuseZ4 mentioned this pull request Oct 24, 2025
5 tasks
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Nov 5, 2025

☔ The latest upstream changes (presumably #148507) made this pull request unmergeable. Please resolve the merge conflicts.

@Sa4dUs Sa4dUs force-pushed the offload-intrinsic branch from e0fd7be to 97a8e96 Compare November 7, 2025 15:37
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented Nov 9, 2025

☔ The latest upstream changes (presumably #148721) made this pull request unmergeable. Please resolve the merge conflicts.

@rustbot rustbot added the A-attributes Area: Attributes (`#[…]`, `#![…]`) label Nov 11, 2025
let i32_0 = cx.get_const_i32(0);
for index in 0..types.len() {
let v = unsafe { llvm::LLVMGetOperand(kernel_call, index as u32).unwrap() };
for index in 0..num_args {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you iterate directly over args now?

return Ok(());
}
sym::offload => {
// FIXME(Sa4dUs): emit error when offload is not enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have that as a TODO, thus fixed before we land it. You can just copy the check I had above (and potentially error if you find an intrinsic, but not the flag set). The current main already requires it, and we shouldn't change nightly behaviour because of an experimental feature without a user setting a feature flag.

Comment on lines 89 to 95
| ty::FnDef(_, _)
| ty::FnPtr(_, _)
| ty::Closure(_, _)
| ty::CoroutineClosure(_, _)
| ty::Coroutine(_, _)
| ty::CoroutineWitness(_, _)
| ty::Never
Copy link
Member

@ZuseZ4 ZuseZ4 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect all of these to not be handled correctly by offload without further work, so I'd just error.

The same goes for

            | ty::Bound(_, _)
            | ty::Placeholder(_)
            | ty::Infer(_)
            | ty::Error(_)

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@Sa4dUs Sa4dUs marked this pull request as ready for review November 16, 2025 10:27
@rustbot
Copy link
Collaborator

rustbot commented Nov 16, 2025

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 16, 2025
@rust-log-analyzer

This comment has been minimized.

#[rustc_intrinsic]
pub const fn autodiff<F, G, T: crate::marker::Tuple, R>(f: F, df: G, args: T) -> R;

/// Generates the LLVM body of a wrapper function to offload a kernel `f`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have other backends besides LLVM, so intrinsics typically should be described in terms of what they do, not implementation details. Is that possible here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this intrinsic only makes sense in LLVM right now because it relies directly on LLVM's offload feature. that's why i wanted to specify the backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a better way to proceed, please let me know

Copy link
Member

@RalfJung RalfJung Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, seems like we did something similar for the autodiff intrinsic. I would assume that the concept of offloading is independent of LLVM, but maybe we don't have to figure out that full story at this point.

Are there docs for the LLVM offload feature you could link to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say https://clang.llvm.org/docs/OffloadingDesign.html contains all the relevant details

ping @ZuseZ4 in case he has something better

Copy link
Member

@ZuseZ4 ZuseZ4 Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your link is probably the best overview. Offload grew out of OpenMP, which is also supported by other compilers like GCC. LLVM just put in some effort to split Offloading and OpenMP, so that the former is easier to use independently. https://gcc.gnu.org/projects/gomp/

ping @antoyo just for awareness.

With respect to a high-level explanation of this intrinsic:
We use a single-source, two-pass compilation approach. We compile all functions that should be offloaded for the device (e.g nvptx64, amdgcn-amd-amdhsa, intel in the future) and which are marked by our intrinsic. We then compile the code for the host (e.g. x86-64), where most of the offloading logic happens. On the host side, we generate calls to the openmp offload runtime, to inform it about the layout of the types (a simplified version of the autodiff TypeTrees). We also use the type system to figure out whether kernel arguments have to be moved only to the device (e.g. &[f32;1024]), from the device, or both (e.g. &mut [f64]). We then launched the kernel, after which we inform the runtime to end this environment and move data back (as far as needed).

There are obviously a lot of features and optimizations which we want to add in the future. The Rust frontend currently also mostly uses the OpenMP API, since it was more stable back when I started working on it. We intend to move over to the newer offload API, which is slightly lower level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-attributes Area: Attributes (`#[…]`, `#![…]`) A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. F-gpu_offload `#![feature(gpu_offload)]` S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants