[XPU] add fused_dropout_add XPU kernel and remove Python fallback by YqGe585 · Pull Request #78629 · PaddlePaddle/Paddle

YqGe585 · 2026-04-10T03:18:50Z

PR Category

Custom Device

PR Types

New features

Description

paddle.incubate.nn.functional.fused_dropout_add 在 XPU 设备上存在精度问题，根本原因有两个：

问题 1：Python 层 fallback 导致 GPU/XPU 使用独立 PRNG 状态

已安装的 Paddle 中 fused_dropout_add.py 存在临时 Python fallback，强制 GPU 和 XPU 都通过 paddle.nn.functional.dropout 执行，而两个设备各自推进独立的随机数状态，导致 dropout mask 不同，产生最大绝对误差达 0.66748（阈值为 0.05）。同时会发出警告："Currently, fused_dropout_add maybe has precision problem, so it falls back to dropout + add."

问题 2：XPU 缺少 fused_dropout_add C++ 算子

移除 Python fallback 后，XPU 设备尝试调用 _C_ops.fused_dropout_add 时报错：NotFound: kernel fused_dropout_add is not registered，因为 XPU 从未实现该算子。

本次修改内容：

新增 XPU forward kernel（fused_dropout_add_kernel.cc）：使用 xpu::dropout() + xpu::add()，解析 seed 并存入 seed_offset 以供 backward 复现 mask
新增 XPU backward kernel（fused_dropout_add_grad_kernel.cc）：从 seed_offset 恢复 seed，重新生成 mask 以计算梯度
在 XPU2 和 XPU3 op list 中注册 fused_dropout_add 和 fused_dropout_add_grad（FLOAT32、FLOAT16）
移除 fused_dropout_add.py 中的 Python fallback，使所有设备直接调用 C++ kernel

验证： 修复后 XPU kernel 可正常执行，不再报 kernel not found 错误，也不再发出 fallback 警告。

是否引起精度变化

是。移除 Python fallback 后，XPU 设备将使用原生 XPU C++ kernel（xpu::dropout()）而非 paddle.nn.functional.dropout fallback。
由于 GPU 使用 Philox4 PRNG，XPU 使用 XPU 库自有 PRNG，即便给定相同 seed，两者生成的 dropout mask 也不同——这是跨设备随机算子的固有特性，不影响单设备下的正确性。

…and remove Python fallback - Add XPU forward/backward kernels for fused_dropout_add (paddle/phi/kernels/fusion/xpu/fused_dropout_add_kernel.cc and paddle/phi/kernels/fusion/xpu/fused_dropout_add_grad_kernel.cc) - Register fused_dropout_add in XPU2 and XPU3 op lists (FLOAT32, FLOAT16) - Remove Python-level fallback in fused_dropout_add.py that was causing both GPU and XPU to use paddle.nn.functional.dropout with independent PRNG state, producing non-comparable stochastic results The XPU kernel uses xpu::dropout() with a resolved seed and adds the result to y. Note: element-wise results differ from GPU due to different PRNG algorithms (XPU library vs GPU Philox4) — expected for stochastic ops.

paddle-bot · 2026-04-10T03:18:57Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

YqGe585 added 2 commits April 9, 2026 23:16

chore: apply prek fixes (clang-format, ruff)

63bd48f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XPU] add fused_dropout_add XPU kernel and remove Python fallback#78629

[XPU] add fused_dropout_add XPU kernel and remove Python fallback#78629
YqGe585 wants to merge 2 commits intoPaddlePaddle:developfrom
YqGe585:xpu-worker1/GEY-23-xpu-precision

YqGe585 commented Apr 10, 2026

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YqGe585 commented Apr 10, 2026

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant