Skip to content

[XPU] add fused_dropout_add XPU kernel and remove Python fallback#78629

Open
YqGe585 wants to merge 2 commits intoPaddlePaddle:developfrom
YqGe585:xpu-worker1/GEY-23-xpu-precision
Open

[XPU] add fused_dropout_add XPU kernel and remove Python fallback#78629
YqGe585 wants to merge 2 commits intoPaddlePaddle:developfrom
YqGe585:xpu-worker1/GEY-23-xpu-precision

Conversation

@YqGe585
Copy link
Copy Markdown
Member

@YqGe585 YqGe585 commented Apr 10, 2026

PR Category

Custom Device

PR Types

New features

Description

paddle.incubate.nn.functional.fused_dropout_add 在 XPU 设备上存在精度问题,根本原因有两个:

问题 1:Python 层 fallback 导致 GPU/XPU 使用独立 PRNG 状态

已安装的 Paddle 中 fused_dropout_add.py 存在临时 Python fallback,强制 GPU 和 XPU 都通过 paddle.nn.functional.dropout 执行,而两个设备各自推进独立的随机数状态,导致 dropout mask 不同,产生最大绝对误差达 0.66748(阈值为 0.05)。同时会发出警告:"Currently, fused_dropout_add maybe has precision problem, so it falls back to dropout + add."

问题 2:XPU 缺少 fused_dropout_add C++ 算子

移除 Python fallback 后,XPU 设备尝试调用 _C_ops.fused_dropout_add 时报错:NotFound: kernel fused_dropout_add is not registered,因为 XPU 从未实现该算子。

本次修改内容:

  • 新增 XPU forward kernel(fused_dropout_add_kernel.cc):使用 xpu::dropout() + xpu::add(),解析 seed 并存入 seed_offset 以供 backward 复现 mask
  • 新增 XPU backward kernel(fused_dropout_add_grad_kernel.cc):从 seed_offset 恢复 seed,重新生成 mask 以计算梯度
  • 在 XPU2 和 XPU3 op list 中注册 fused_dropout_addfused_dropout_add_grad(FLOAT32、FLOAT16)
  • 移除 fused_dropout_add.py 中的 Python fallback,使所有设备直接调用 C++ kernel

验证: 修复后 XPU kernel 可正常执行,不再报 kernel not found 错误,也不再发出 fallback 警告。

是否引起精度变化

是。移除 Python fallback 后,XPU 设备将使用原生 XPU C++ kernel(xpu::dropout())而非 paddle.nn.functional.dropout fallback。
由于 GPU 使用 Philox4 PRNG,XPU 使用 XPU 库自有 PRNG,即便给定相同 seed,两者生成的 dropout mask 也不同——这是跨设备随机算子的固有特性,不影响单设备下的正确性。

YqGe585 added 2 commits April 9, 2026 23:16
…and remove Python fallback

- Add XPU forward/backward kernels for fused_dropout_add
  (paddle/phi/kernels/fusion/xpu/fused_dropout_add_kernel.cc and
   paddle/phi/kernels/fusion/xpu/fused_dropout_add_grad_kernel.cc)
- Register fused_dropout_add in XPU2 and XPU3 op lists (FLOAT32, FLOAT16)
- Remove Python-level fallback in fused_dropout_add.py that was causing
  both GPU and XPU to use paddle.nn.functional.dropout with independent
  PRNG state, producing non-comparable stochastic results

The XPU kernel uses xpu::dropout() with a resolved seed and adds the result
to y. Note: element-wise results differ from GPU due to different PRNG
algorithms (XPU library vs GPU Philox4) — expected for stochastic ops.
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 10, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant