[bug][train] `max_seq_len` auto-calculation is incorrect for multi-turn in `seq_mean_token_sum_norm`

### Problem

When `trainer.algorithm.max_seq_len` is not explicitly set, `validate_cfg()` computes it as:

```python
max_seq_len = cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
```

This is incorrect for multi-turn conversations because `max_generate_length` is a **per-turn** `max_tokens` parameter passed to the inference engine for each generation call, not the total response token budget across all turns.

For example, with `max_turns=5`, `max_input_length=2048`, and `max_generate_length=4096`:
- **Current auto-calculation**: `max_seq_len = 2048 + 4096 = 6144`
- **Actual possible sequence length**: up to `2048 + 5 * 4096 = 22528` (prompt + 5 turns of generation, plus tool outputs/observations between turns)

This means the `seq_mean_token_sum_norm` (Dr. GRPO) loss reduction normalizes by a value that is too small for multi-turn, which would scale up the loss magnitude for multi-turn sequences relative to what Dr. GRPO intends.

### Where it happens

`skyrl/train/utils/utils.py` in `validate_cfg()` (and the mirrored `skyrl-train/skyrl_train/utils/utils.py`):

```python
if cfg.trainer.algorithm.max_seq_len is None:
    # TODO(Charlie): This calculation is not correct for multi-turn...
    if isinstance(cfg, DictConfig):
        new_cfg = OmegaConf.create(cfg.trainer.algorithm)
        new_cfg.max_seq_len = cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
        cfg.trainer.algorithm = new_cfg
    else:
        cfg.trainer.algorithm.max_seq_len = (
            cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length
        )
```

### Current workaround

Users can set `trainer.algorithm.max_seq_len` explicitly (https://github.com/NovaSky-AI/SkyRL/pull/1153). For Harbor, this is already done via `trainer.algorithm.max_seq_len=$MAX_MODEL_LEN` in the example scripts.

### Possible fixes

1. **Require explicit `max_seq_len` when `loss_reduction=seq_mean_token_sum_norm`** — remove the auto-calculation entirely and error if not set. This is the safest option since the "correct" value depends on the use case.
2. **Use the vLLM engine's `max_model_len`** as the default, since that represents the true context window limit.
3. **Factor in `max_turns`** in the auto-calculation, e.g. `max_input_length + max_turns * max_generate_length`. This is still an approximation since tool outputs between turns add tokens too, but it's closer to correct.

Option 1 seems cleanest — the auto-calculation was always a convenience heuristic, and for the loss reduction that actually uses this value, users should think about what the right normalization constant is for their setup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug][train] `max_seq_len` auto-calculation is incorrect for multi-turn in `seq_mean_token_sum_norm` #1154

Problem

Where it happens

Current workaround

Possible fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug][train] max_seq_len auto-calculation is incorrect for multi-turn in seq_mean_token_sum_norm #1154

Description

Problem

Where it happens

Current workaround

Possible fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[bug][train] `max_seq_len` auto-calculation is incorrect for multi-turn in `seq_mean_token_sum_norm` #1154