Gradient checkpointing + dropout causes loss divergence

Highlighting for visibility:

The custom checkpoint helper in this repo re-runs the forward pass during backprop without restoring the RNG state. Every stochastic layer inside the block, like dropout, sees a different random mask on the backward pass, so the gradients don't match the loss. So non-zero dropout with gradient checkpoint enabled causes loss to diverge.

Code link: [nn.py#L124](https://github.com/openai/guided-diffusion/blob/22e0df8183507e13a7813f8d38d51b072ca1e67c/guided_diffusion/nn.py#L124)

<img width="1684" height="1088" alt="Image" src="https://github.com/user-attachments/assets/3f251cef-9704-4b16-807b-9ae40ed798b7" />

This Colab notebook isolates the issue with code from this repo.
[Colab notebook](https://colab.research.google.com/drive/1dJByrYCSjGdXrleJIDpkuz_wFgLkaNbI?usp=sharing)

I wrote more details here after using it for a large model training
[https://almutwakel.com/blog/divergence](https://almutwakel.com/blog/divergence)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient checkpointing + dropout causes loss divergence #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradient checkpointing + dropout causes loss divergence #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions