[Enhance] enhance logging (mostly on token stats) #1333

nil0x9 · 2025-12-05T04:20:30Z

use reduced tokens stats for exp_tgs;
rename some variables (drop the reduced_ prefix, use total_ instead for clarity);
drop maxvio stats in loss_log as it is already covered in internal metrics.

jayhenry · 2025-12-05T06:13:53Z

xtuner/v1/train/trainer.py

-            self._reduced_consumed_tokens += reduced_step_consumed_tokens
-
-            self._exp_consumed_tokens += step_consumed_tokens
+            self._total_consumed_tokens += reduced_step_consumed_tokens


这里命名要注意保持和loss在代码中和日志中的一致。
比如 total_loss表示单卡上总loss（llm, balance, zloss）， reduced_llm_loss表示各卡聚合后llm_loss。
所以建议要么保留原来的 reduced_consumed_tokens命名，要么将 loss的命名也改成和这里一致。

这里确实有歧义，在现在的 codebase 里，total_ 被用来表示时间维度上的总和 (e.g., total_[step|epoch|iter]），也有被用来表达空间上的总和（total_GPU_per_node），total_loss是表示几种 loss 的聚合，显得有点混乱。感觉是应该有个好的 naming convention 来把这些统一的区分开。

这里的_reduced_consumed_tokens 其实同时兼有时间维度上的累积，和空间维度的聚合两个含义，所以感觉有点奇怪，我想一想要怎么改善这个 PR。

这里命名要注意保持和loss在代码中和日志中的一致。比如 total_loss表示单卡上总loss（llm, balance, zloss）， reduced_llm_loss表示各卡聚合后llm_loss。所以建议要么保留原来的 reduced_consumed_tokens命名，要么将 loss的命名也改成和这里一致。

我现在把 total_loss 修改为 local_loss，同时还附带把原来 extra_log_info 里的一个叫 loss 的 key 重新命名（这个值对 pt/sft 而言是 reduce_sum 之前的 ce_loss，在 tensorboard 上很容易引发歧义）.

我理想来讲，step vs. 整个训练周期、单 rank vs. 所有 rank、单个值 vs. 总和，都要有个前缀做区分，但是这样每个变量都会显得非常冗余，个人浅见只能退而求其次，做一些合理的省略（只要上下文是清晰的）：

对于 loss 而言，几个 loss 加和，理论上只要叫 loss 即可，细分的 loss 用 xx_loss 做区别；

对于 loss 或者 tokens 而言，需要区分单 rank 和全局的求和或平均，由于单 rank 的 log 不多，如无特别前缀，都表示 reduced 之后的版本，rank 上的值用 local_ 前缀表示；

时间维度上仍然需要区分单步统计值或者累计值，主要针对 tokens 而言，loss 没有这类需求，step 表示当前步， total 表示整个训练周期。

Good! 总结下前缀规则就是：
空间上，rank的用 local_，默认reduced无前缀。
时间上，用step_和total_。

jayhenry · 2025-12-26T06:43:45Z

xtuner/v1/engine/train_engine.py

            dist.all_reduce(reduced_z_loss.div_(dist.get_world_size()))
            loss_log["reduced_z_loss"] = reduced_z_loss.item()
-        other_log["consumed_tokens"] = step_consumed_tokens.item()
+        other_log["consumed_tokens"] = cast(int, step_consumed_tokens.item())


rename to other_log["step_consumed_tokens"] to keep consistent

jayhenry · 2025-12-26T06:46:04Z

xtuner/v1/engine/vision_compose_train_engine.py

            dist.all_reduce(reduced_z_loss.div_(dist.get_world_size()))
            loss_log["reduced_z_loss"] = reduced_z_loss.item()
-        other_log["consumed_tokens"] = step_consumed_tokens.item()
+        other_log["consumed_tokens"] = cast(int, step_consumed_tokens.item())


rename to other_log["step_consumed_tokens"]

jayhenry · 2025-12-26T06:47:17Z

xtuner/v1/model/utils/misc.py

                self["max_ratio"] = torch.max(self["max_ratio"], dim=-1).values
            max_ratio_value = self["max_ratio"].item()
            return_dict["max_ratio"] = max_ratio_value
        if "log_rank_loss" in self:


rename log_rank_loss to local_base_loss

1. use reduced tokens stats for exp_tgs; 2. rename some variables (drop the reduced_ prefix, use total_ instead for clarity); 3. drop maxvio stats in loss_log as it is already covered in internal metrics.

jayhenry reviewed Dec 5, 2025

View reviewed changes

nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from b9d33dc to 1edbcea Compare December 5, 2025 12:20

jayhenry reviewed Dec 26, 2025

View reviewed changes

nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from 1edbcea to 721275f Compare December 26, 2025 07:21

jayhenry approved these changes Dec 26, 2025

View reviewed changes

jayhenry mentioned this pull request Dec 26, 2025

[Refactor] refactor trainer fit loop for better code organization #1388

Open

nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from 434ba94 to f2410be Compare December 26, 2025 10:15

nil0x9 added 4 commits December 26, 2025 20:13

[Enhance] enhance logging (mostly on token stats)

b602fa1

1. use reduced tokens stats for exp_tgs; 2. rename some variables (drop the reduced_ prefix, use total_ instead for clarity); 3. drop maxvio stats in loss_log as it is already covered in internal metrics.

[Enhance] distinguish local loss from (global) reduced ones

f5ea6cc

[Enhance] extra_info: rename log_rank_loss to local_base_loss

dd08e04

[Enhance] rename consumed_tokens to step_* in other_log for consistency

9624cad

nil0x9 force-pushed the linty/dev-enhance-tok-logs branch from f2410be to 9624cad Compare December 26, 2025 12:15

jayhenry merged commit 53ff052 into InternLM:main Dec 27, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhance] enhance logging (mostly on token stats) #1333

[Enhance] enhance logging (mostly on token stats) #1333

Uh oh!

nil0x9 commented Dec 5, 2025

Uh oh!

jayhenry Dec 5, 2025

Uh oh!

nil0x9 Dec 5, 2025

Uh oh!

nil0x9 Dec 5, 2025

Uh oh!

jayhenry Dec 26, 2025

Uh oh!

jayhenry Dec 26, 2025

Uh oh!

jayhenry Dec 26, 2025

Uh oh!

jayhenry Dec 26, 2025

Uh oh!

nil0x9 Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Enhance] enhance logging (mostly on token stats) #1333

[Enhance] enhance logging (mostly on token stats) #1333

Uh oh!

Conversation

nil0x9 commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants