Skip to content

Conversation

@nil0x9
Copy link
Contributor

@nil0x9 nil0x9 commented Dec 5, 2025

  1. use reduced tokens stats for exp_tgs;
  2. rename some variables (drop the reduced_ prefix, use total_ instead for clarity);
  3. drop maxvio stats in loss_log as it is already covered in internal metrics.

self._reduced_consumed_tokens += reduced_step_consumed_tokens

self._exp_consumed_tokens += step_consumed_tokens
self._total_consumed_tokens += reduced_step_consumed_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里命名要注意保持和loss在代码中和日志中的一致。
比如 total_loss表示单卡上总loss(llm, balance, zloss), reduced_llm_loss表示各卡聚合后llm_loss。
所以建议 要么保留原来的 reduced_consumed_tokens命名,要么将 loss的命名也改成和这里一致。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里确实有歧义,在现在的 codebase 里,total_ 被用来表示时间维度上的总和 (e.g., total_[step|epoch|iter]),也有被用来表达空间上的总和(total_GPU_per_node),total_loss是表示几种 loss 的聚合, 显得有点混乱。感觉是应该有个好的 naming convention 来把这些统一的区分开。

这里的_reduced_consumed_tokens 其实同时兼有 时间维度上的累积,和空间维度的聚合两个含义,所以感觉有点奇怪,我想一想要怎么改善这个 PR。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里命名要注意保持和loss在代码中和日志中的一致。 比如 total_loss表示单卡上总loss(llm, balance, zloss), reduced_llm_loss表示各卡聚合后llm_loss。 所以建议 要么保留原来的 reduced_consumed_tokens命名,要么将 loss的命名也改成和这里一致。

我现在把 total_loss 修改为 local_loss,同时还附带把原来 extra_log_info 里的一个叫 loss 的 key 重新命名(这个值对 pt/sft 而言是 reduce_sum 之前的 ce_loss,在 tensorboard 上很容易引发歧义).

我理想来讲,step vs. 整个训练周期、单 rank vs. 所有 rank、 单个值 vs. 总和,都要有个前缀做区分,但是这样每个变量都会显得非常冗余,个人浅见只能退而求其次,做一些合理的省略(只要上下文是清晰的):

  1. 对于 loss 而言,几个 loss 加和,理论上只要叫 loss 即可,细分的 loss 用 xx_loss 做区别;
  2. 对于 loss 或者 tokens 而言,需要区分单 rank 和全局的求和或平均,由于单 rank 的 log 不多,如无特别前缀,都表示 reduced 之后的 版本,rank 上的值用 local_ 前缀表示;
  3. 时间维度上仍然需要区分单步统计值或者累计值,主要针对 tokens 而言,loss 没有这类需求,step 表示当前步, total 表示整个训练周期。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! 总结下前缀规则就是:
空间上,rank的用 local_,默认reduced无前缀。
时间上,用step_和total_。

@nil0x9 nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from b9d33dc to 1edbcea Compare December 5, 2025 12:20
dist.all_reduce(reduced_z_loss.div_(dist.get_world_size()))
loss_log["reduced_z_loss"] = reduced_z_loss.item()
other_log["consumed_tokens"] = step_consumed_tokens.item()
other_log["consumed_tokens"] = cast(int, step_consumed_tokens.item())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to other_log["step_consumed_tokens"] to keep consistent

dist.all_reduce(reduced_z_loss.div_(dist.get_world_size()))
loss_log["reduced_z_loss"] = reduced_z_loss.item()
other_log["consumed_tokens"] = step_consumed_tokens.item()
other_log["consumed_tokens"] = cast(int, step_consumed_tokens.item())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to other_log["step_consumed_tokens"]

self["max_ratio"] = torch.max(self["max_ratio"], dim=-1).values
max_ratio_value = self["max_ratio"].item()
return_dict["max_ratio"] = max_ratio_value
if "log_rank_loss" in self:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename log_rank_loss to local_base_loss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@nil0x9 nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from 1edbcea to 721275f Compare December 26, 2025 07:21
1. use reduced tokens stats for exp_tgs;
2. rename some variables (drop the reduced_ prefix, use total_ instead
for clarity);
3. drop maxvio stats in loss_log as it is already covered in internal
metrics.
@nil0x9 nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from f2410be to 9624cad Compare December 26, 2025 12:15
@jayhenry jayhenry merged commit 53ff052 into InternLM:main Dec 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants