[megatron] feat: add arg to offload bridged weights to CPU

HollowMan6 · HollowMan6 · commit 7984a23e100b · 2025-11-24T14:02:54.000+02:00
Now `offload_bridge` is a supported option to store Megatron
exported HF format weights for vLLM updates in CPU main
memory to reduce GPU memory usage. Default is False.

Signed-off-by: Hollow Man &lt;hollowman@opensuse.org&gt;
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -710,6 +710,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
 - mcore_model: mcore格式模型路径。默认为None。
 - mcore_adapters: mcore格式模型的adapter路径列表，默认为空列表。
 - thread_count: `--to_mcore true`时的模型切片数。默认为None，根据模型大小自动设置，使得最大分片小于10GB。
+- 🔥offload_bridge: Megatron导出的用于vLLM更新HF格式权重使用CPU主存存放，以降低 GPU 显存占用。默认为 False。
 - 🔥test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False。
 - test_convert_dtype: 转换精度测试使用的dtype，默认为'float32'。
 - 🔥push_to_hub: 是否推送hub，默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh)。
diff --git a/docs/source/Instruction/GRPO/GetStarted/GRPO.md b/docs/source/Instruction/GRPO/GetStarted/GRPO.md
@@ -159,6 +159,12 @@ GRPO 训练框架支持集成高性能推理引擎（如 vLLM）来加速采样
 --move_model_batches [批次数量]
 ```
 
+6. 将 Megatron 导出的用于 vLLM 更新的 HF 格式权重存放在 CPU 主存中，以降低 GPU 显存占用：
+
+```bash
+--offload_bridge true
+```
+
 ### 2. Async(External) Mode
 
 训练与推理资源分离，启动单独的推理服务器
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -258,6 +258,7 @@ lora训练：
 - hub_token: hub token. modelscope的hub token可以查看[这里](https://modelscope.cn/my/myaccesstoken)。默认为None。
 - merge_lora: 是否存储合并后的权重。默认为None，若`save_safetensors`设置为True，该参数默认值为`True`，否则为False。即默认情况下，存储为safetensors格式时会合并LoRA；存储为torch_dist格式时，不会合并LoRA。
 - max_shard_size: safetensors格式存储文件最大大小，默认'5GB'。
+- 🔥offload_bridge: Megatron导出的用于vLLM更新HF格式权重使用CPU主存存放，以降低 GPU 显存占用。默认为 False。
 
 ## 训练参数
 
diff --git a/docs/source/Megatron-SWIFT/Mcore-Bridge.md b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
@@ -190,6 +190,8 @@ swift infer \
     --stream true
 ```
 
+提示：如果在vLLM权重更新期间遇到 GPU OOM 问题，您可以设置 `--offload_bridge true` 将张量卸载到 CPU 并减少 GPU 内存使用量。
+
 ## 导出与转换精度测试
 
 Mcore-Bridge除了支持在训练中进行safetensors的转换和保存，也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时，对转换精度进行测试，这在接入新模型时验证接入准确性很有帮助。通常，Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况，你可以放心设置`--test_convert_precision false`。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -728,6 +728,7 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
 - mcore_model: Path to the mcore format model. Default is None.
 - mcore_adapters: List of paths to mcore format model adapters, default is empty list.
 - thread_count: The number of model slices when `--to_mcore true` is set. Defaults to None, and is automatically configured based on the model size, ensuring that the largest slice is less than 10GB.
+- 🔥offload_bridge: Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.
 - 🔥test_convert_precision: Test the precision error when converting weights between HF and Megatron formats. Default is False.
 - test_convert_dtype: The dtype used for conversion precision testing, defaults to 'float32'.
 - 🔥push_to_hub: Whether to push to the hub, with the default being False. Examples can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh).
diff --git a/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md b/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
@@ -159,6 +159,12 @@ When running in Colocate mode, out-of-memory (OOM) issues may frequently occur.
 --move_model_batches [批次数量]
 ```
 
+6. Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage:
+
+```bash
+--offload_bridge true
+```
+
 ### 2. Async(External) Mode
 
 Training and inference resources are separated, with a dedicated inference server deployed.
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -275,6 +275,7 @@ LoRA Training:
 - hub_token: Hub token. ModelScope hub token can be found [here](https://modelscope.cn/my/myaccesstoken). Default is None.
 - merge_lora: Whether to store merged weights. Defaults to None. If `save_safetensors` is set to True, this parameter defaults to `True`; otherwise, it defaults to False. That is, by default, LoRA will be merged when storing in safetensors format; LoRA will not be merged when storing in torch_dist format.
 - max_shard_size: Maximum file size for safetensors format storage, defaults to '5GB'.
+- 🔥offload_bridge: Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.
 
 
 ## Training Parameters
diff --git a/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md b/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md
@@ -200,6 +200,8 @@ swift infer \
     --stream true
 ```
 
+Tip: If you encounter GPU OOM issues during weight synchronization with vLLM, you can set `--offload_bridge true` to offload intermediate tensors to the CPU and reduce GPU memory usage.
+
 ## Export and Conversion Precision Testing
 
 In addition to supporting safetensors conversion and saving during training, Mcore-Bridge also supports the `megatron export` command for standalone weight export. `megatron export` supports conversion precision testing during weight conversion, which is very helpful for verifying accuracy when integrating new models. Typically, models already integrated into Megatron-SWIFT will not have precision misalignment issues, so you can confidently set `--test_convert_precision false`.
diff --git a/examples/megatron/grpo/dense_colocate.sh b/examples/megatron/grpo/dense_colocate.sh
@@ -45,6 +45,7 @@ megatron rlhf \
     --loss_type grpo \
     --sleep_level 2 \
     --offload_model true \
+    --offload_bridge false \
     --offload_optimizer true \
     --log_interval 1 \
     --recompute_granularity selective \
diff --git a/examples/megatron/grpo/dense_server.sh b/examples/megatron/grpo/dense_server.sh
@@ -52,6 +52,7 @@ megatron rlhf \
     --loss_type grpo \
     --sleep_level 2 \
     --offload_model true \
+    --offload_bridge false \
     --offload_optimizer true \
     --log_interval 1 \
     --recompute_granularity selective \
diff --git a/examples/megatron/grpo/moe_colocate_full.sh b/examples/megatron/grpo/moe_colocate_full.sh
@@ -36,6 +36,7 @@ megatron rlhf \
     --loss_type grpo \
     --sleep_level 2 \
     --offload_model true \
+    --offload_bridge false \
     --offload_optimizer true \
     --optimizer_cpu_offload true \
     --use_precision_aware_optimizer \
diff --git a/examples/megatron/grpo/moe_colocate_lora.sh b/examples/megatron/grpo/moe_colocate_lora.sh
@@ -36,6 +36,7 @@ megatron rlhf \
     --loss_type grpo \
     --sleep_level 2 \
     --offload_model true \
+    --offload_bridge false \
     --offload_optimizer true \
     --log_interval 1 \
     --recompute_granularity selective \
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
@@ -63,6 +63,7 @@ class RLHFMegatronArgumentsMixin:
     sleep_level: Literal[0, 1, 2] = 0
     offload_optimizer: bool = False
     offload_model: bool = False
+    offload_bridge: bool = False
 
     vllm_server_base_url: Optional[List[str]] = None
     vllm_server_host: Optional[List[str]] = None
diff --git a/swift/megatron/trainers/grpo_trainer.py b/swift/megatron/trainers/grpo_trainer.py
@@ -222,8 +222,11 @@ def _export_and_load_weights(self):
         For server mode: Process weights in buckets to avoid memory spikes.
         """
         # Export weights returns an iterator
+        target_device = None
+        if self.args.offload_bridge:
+            target_device = 'cpu'
         with profiling_context(self, 'export_weights'):
-            weight_iterator = self.bridge.export_weights(self.unwrapped_models)
+            weight_iterator = self.bridge.export_weights(self.unwrapped_models, target_device=target_device)
 
         if self.vllm_mode == 'colocate':
             # Colocate mode: load_weights supports iterator, pass directly