Problem
When trainer.algorithm.max_seq_len is not explicitly set, validate_cfg() computes it as:
max_seq_len = cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
This is incorrect for multi-turn conversations because max_generate_length is a per-turn max_tokens parameter passed to the inference engine for each generation call, not the total response token budget across all turns.
For example, with max_turns=5, max_input_length=2048, and max_generate_length=4096:
- Current auto-calculation:
max_seq_len = 2048 + 4096 = 6144
- Actual possible sequence length: up to
2048 + 5 * 4096 = 22528 (prompt + 5 turns of generation, plus tool outputs/observations between turns)
This means the seq_mean_token_sum_norm (Dr. GRPO) loss reduction normalizes by a value that is too small for multi-turn, which would scale up the loss magnitude for multi-turn sequences relative to what Dr. GRPO intends.
Where it happens
skyrl/train/utils/utils.py in validate_cfg() (and the mirrored skyrl-train/skyrl_train/utils/utils.py):
if cfg.trainer.algorithm.max_seq_len is None:
# TODO(Charlie): This calculation is not correct for multi-turn...
if isinstance(cfg, DictConfig):
new_cfg = OmegaConf.create(cfg.trainer.algorithm)
new_cfg.max_seq_len = cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
cfg.trainer.algorithm = new_cfg
else:
cfg.trainer.algorithm.max_seq_len = (
cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
)
Current workaround
Users can set trainer.algorithm.max_seq_len explicitly (#1153). For Harbor, this is already done via trainer.algorithm.max_seq_len=$MAX_MODEL_LEN in the example scripts.
Possible fixes
- Require explicit
max_seq_len when loss_reduction=seq_mean_token_sum_norm — remove the auto-calculation entirely and error if not set. This is the safest option since the "correct" value depends on the use case.
- Use the vLLM engine's
max_model_len as the default, since that represents the true context window limit.
- Factor in
max_turns in the auto-calculation, e.g. max_input_length + max_turns * max_generate_length. This is still an approximation since tool outputs between turns add tokens too, but it's closer to correct.
Option 1 seems cleanest — the auto-calculation was always a convenience heuristic, and for the loss reduction that actually uses this value, users should think about what the right normalization constant is for their setup.
Problem
When
trainer.algorithm.max_seq_lenis not explicitly set,validate_cfg()computes it as:This is incorrect for multi-turn conversations because
max_generate_lengthis a per-turnmax_tokensparameter passed to the inference engine for each generation call, not the total response token budget across all turns.For example, with
max_turns=5,max_input_length=2048, andmax_generate_length=4096:max_seq_len = 2048 + 4096 = 61442048 + 5 * 4096 = 22528(prompt + 5 turns of generation, plus tool outputs/observations between turns)This means the
seq_mean_token_sum_norm(Dr. GRPO) loss reduction normalizes by a value that is too small for multi-turn, which would scale up the loss magnitude for multi-turn sequences relative to what Dr. GRPO intends.Where it happens
skyrl/train/utils/utils.pyinvalidate_cfg()(and the mirroredskyrl-train/skyrl_train/utils/utils.py):Current workaround
Users can set
trainer.algorithm.max_seq_lenexplicitly (#1153). For Harbor, this is already done viatrainer.algorithm.max_seq_len=$MAX_MODEL_LENin the example scripts.Possible fixes
max_seq_lenwhenloss_reduction=seq_mean_token_sum_norm— remove the auto-calculation entirely and error if not set. This is the safest option since the "correct" value depends on the use case.max_model_lenas the default, since that represents the true context window limit.max_turnsin the auto-calculation, e.g.max_input_length + max_turns * max_generate_length. This is still an approximation since tool outputs between turns add tokens too, but it's closer to correct.Option 1 seems cleanest — the auto-calculation was always a convenience heuristic, and for the loss reduction that actually uses this value, users should think about what the right normalization constant is for their setup.