The wiki suggests a batch size of 128 is recommended for 'stable training'.
It would be helpful to have the option to accumulate gradients so that bicleaner-ai training with larger "effective batch size" were possible on GPUs with a relatively small amount of RAM.
Fairseq calls this option "--update-freq"
Sockeye calls this option "--update-interval"