RFC: GPU Enablement for Tunix (Jax GPU) #705

wang2yn84 · 2025-11-08T23:14:03Z

wang2yn84
Nov 8, 2025
Maintainer

Status: Draft for community feedback
Authors: Lance Wang

Motivation

Tunix focuses on TPU-first training/inference for SFT and RL (DPO/PPO/GRPO/GSPO/...). Community users and contributors have expressed strong interest in running Tunix on GPUs (A100/H100, 4090/4080, L4, etc.). GPU support broadens accessibility, enables on‑prem/smaller‑scale runs, and increases contributor velocity.

Goals & Non‑Goals
Goals

Modularized design: clear boundary between GPU and TPU

Parity path: Run the full Tunix SFT + RL stack on single- and multi‑GPU via XLA:GPU (PJRT), starting from unit tests to the notebook/script examples: https://github.com/google/tunix/tree/main/examples

Performance-minded: Competitive throughput/latency using bf16/fp16, Flash‑Attention, and fused optimizers.

Simple install: Docker and Conda instructions; reproducible envs for CUDA 12.x/13.x + cuDNN 9 or ROCm.

CI coverage: Smoke tests on single‑GPU; nightly correctness on multi‑GPU via self-hosted runners.

Docs: Clear “Getting Started on GPU” & troubleshooting.

Non‑Goals

Non‑JAX training backends (e.g., PyTorch) for core Tunix trainers.

Perfect performance parity with TPU on day one.

User Stories

US1 (Single‑GPU dev): A researcher with a 4090 runs SFT and minimal GRPO locally.

US2 (Multi‑GPU node): A lab with 8xH100 trains 1B–30B using FSDP+TP.

US3 (Multi‑host): A cluster with multiple H100 nodes runs distributed RL with NCCL over InfiniBand/RoCE.

US4 (Eval/Serve): Run RL learning and eval loops on GPU without TPU dependencies.

Requirements

R1: Support CUDA 12/13; cuDNN 9.x.

R2: Support bf16 on Ampere/Hopper; fp16 fallback on consumer GPUs if needed.

R3: PJRT runtime on XLA:GPU; NCCL collectives for dp/fsdp/tp.

R4: Flash‑Attention path (jax-labs/pallas or vendor kernels) for attention-heavy models.

R5: CI smoke tests on single GPU; perf sanity benchmarks; correctness parity on small models.

yhtang · 2025-11-09T07:01:07Z

yhtang
Nov 9, 2025
Collaborator

Thanks for driving this RFC --- I’m strongly aligned with the goal of broadening accessibility beyond TPU.

Some specific points (sharing my own views only):

+1 on the core runtime requirements. XLA:GPU + CUDA 12/13 + cuDNN + NCCL is the right foundation and is largely covered by JAX already (e.g., pip install jax[cuda13] or pip install jax[cuda12]). To improve the developer experience, consider offering device-specific Python install extras to encapsulate these dependencies.
Elevate multi-process / multi-controller SPMD to first-class. On GPUs, realistic training and RL loops rely on multi-process JAX with NCCL collectives. Clear documentation for setup—especially when the rollout engine lives on a different mesh—would meaningfully reduce the learning curve. I’m happy to help with this.
FlashAttention support is great; consider an opt-in Transformer Engine path. In addition to the Pallas route, an optional Transformer Engine (TE) path could provide an NVIDIA-supported attention + GEMM stack on Hopper/Blackwell, particularly for lower precisions such as FP8/FP4.
Prioritize Blackwell/Hopper over Ampere. This ordering better serves cutting-edge users and current/future cluster deployments.
NVIDIA’s NeMo RL (https://github.com/NVIDIA-NeMo/RL) is a useful reference to illustrate the target usage pattern. Mentioning it in the motivation section can help people recognize the broader “training framework + optimized inference engine” workflow across ecosystems.

Thanks again for spearheading this effort.

0 replies

wang2yn84 · 2025-11-11T00:31:06Z

wang2yn84
Nov 11, 2025
Maintainer Author

Thank you @yhtang for the valuable input and +1 on everything!

For bullet point 2, I have a separate RFC because McJax can be device agnostic and both TPU and GPU can benefits from that. Let's also keep in mind with OSS Pathways on track, both single controller Jax and McJax provides unique advantages in different areas. Let's make sure Tunix achieves best performance on both setups.

For bullet point 3, I'm thinking to have an abstract attention adapter layer to provide the unified interface to the model, and we can have various implementations or imports.

For bullet point 4, absolutely! Blackwell > Hopper > Ampere is the way to go.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: GPU Enablement for Tunix (Jax GPU) #705

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: GPU Enablement for Tunix (Jax GPU) #705

Uh oh!

Uh oh!

wang2yn84 Nov 8, 2025 Maintainer

Replies: 2 comments

Uh oh!

yhtang Nov 9, 2025 Collaborator

Uh oh!

wang2yn84 Nov 11, 2025 Maintainer Author

wang2yn84
Nov 8, 2025
Maintainer

yhtang
Nov 9, 2025
Collaborator

wang2yn84
Nov 11, 2025
Maintainer Author