Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
- mcore_model: mcore格式模型路径。默认为None。
- mcore_adapters: mcore格式模型的adapter路径列表,默认为空列表。
- thread_count: `--to_mcore true`时的模型切片数。默认为None,根据模型大小自动设置,使得最大分片小于10GB。
- 🔥offload_bridge: Megatron导出的用于vLLM更新HF格式权重使用CPU主存存放,以降低 GPU 显存占用。默认为 False。
- 🔥test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False。
- test_convert_dtype: 转换精度测试使用的dtype,默认为'float32'。
- 🔥push_to_hub: 是否推送hub,默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh)。
Expand Down
6 changes: 6 additions & 0 deletions docs/source/Instruction/GRPO/GetStarted/GRPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ GRPO 训练框架支持集成高性能推理引擎(如 vLLM)来加速采样
--move_model_batches [批次数量]
```

6. 将 Megatron 导出的用于 vLLM 更新的 HF 格式权重存放在 CPU 主存中,以降低 GPU 显存占用:

```bash
--offload_bridge true
```

### 2. Async(External) Mode

训练与推理资源分离,启动单独的推理服务器
Expand Down
1 change: 1 addition & 0 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ lora训练:
- hub_token: hub token. modelscope的hub token可以查看[这里](https://modelscope.cn/my/myaccesstoken)。默认为None。
- merge_lora: 是否存储合并后的权重。默认为None,若`save_safetensors`设置为True,该参数默认值为`True`,否则为False。即默认情况下,存储为safetensors格式时会合并LoRA;存储为torch_dist格式时,不会合并LoRA。
- max_shard_size: safetensors格式存储文件最大大小,默认'5GB'。
- 🔥offload_bridge: Megatron导出的用于vLLM更新HF格式权重使用CPU主存存放,以降低 GPU 显存占用。默认为 False。

## 训练参数

Expand Down
2 changes: 2 additions & 0 deletions docs/source/Megatron-SWIFT/Mcore-Bridge.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,8 @@ swift infer \
--stream true
```

提示:如果在vLLM权重更新期间遇到 GPU OOM 问题,您可以设置 `--offload_bridge true` 将张量卸载到 CPU 并减少 GPU 内存使用量。

## 导出与转换精度测试

Mcore-Bridge除了支持在训练中进行safetensors的转换和保存,也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时,对转换精度进行测试,这在接入新模型时验证接入准确性很有帮助。通常,Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况,你可以放心设置`--test_convert_precision false`
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -728,6 +728,7 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
- mcore_model: Path to the mcore format model. Default is None.
- mcore_adapters: List of paths to mcore format model adapters, default is empty list.
- thread_count: The number of model slices when `--to_mcore true` is set. Defaults to None, and is automatically configured based on the model size, ensuring that the largest slice is less than 10GB.
- 🔥offload_bridge: Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.
- 🔥test_convert_precision: Test the precision error when converting weights between HF and Megatron formats. Default is False.
- test_convert_dtype: The dtype used for conversion precision testing, defaults to 'float32'.
- 🔥push_to_hub: Whether to push to the hub, with the default being False. Examples can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh).
Expand Down
6 changes: 6 additions & 0 deletions docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ When running in Colocate mode, out-of-memory (OOM) issues may frequently occur.
--move_model_batches [批次数量]
```

6. Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage:

```bash
--offload_bridge true
```

### 2. Async(External) Mode

Training and inference resources are separated, with a dedicated inference server deployed.
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,7 @@ LoRA Training:
- hub_token: Hub token. ModelScope hub token can be found [here](https://modelscope.cn/my/myaccesstoken). Default is None.
- merge_lora: Whether to store merged weights. Defaults to None. If `save_safetensors` is set to True, this parameter defaults to `True`; otherwise, it defaults to False. That is, by default, LoRA will be merged when storing in safetensors format; LoRA will not be merged when storing in torch_dist format.
- max_shard_size: Maximum file size for safetensors format storage, defaults to '5GB'.
- 🔥offload_bridge: Store Megatron exported HF format weights for vLLM updates in CPU main memory to reduce GPU memory usage. Default is False.


## Training Parameters
Expand Down
2 changes: 2 additions & 0 deletions docs/source_en/Megatron-SWIFT/Mcore-Bridge.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@ swift infer \
--stream true
```

Tip: If you encounter GPU OOM issues during weight synchronization with vLLM, you can set `--offload_bridge true` to offload intermediate tensors to the CPU and reduce GPU memory usage.

## Export and Conversion Precision Testing

In addition to supporting safetensors conversion and saving during training, Mcore-Bridge also supports the `megatron export` command for standalone weight export. `megatron export` supports conversion precision testing during weight conversion, which is very helpful for verifying accuracy when integrating new models. Typically, models already integrated into Megatron-SWIFT will not have precision misalignment issues, so you can confidently set `--test_convert_precision false`.
Expand Down
1 change: 1 addition & 0 deletions examples/megatron/grpo/dense_colocate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ megatron rlhf \
--loss_type grpo \
--sleep_level 2 \
--offload_model true \
--offload_bridge false \
--offload_optimizer true \
--log_interval 1 \
--recompute_granularity selective \
Expand Down
1 change: 1 addition & 0 deletions examples/megatron/grpo/dense_server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ megatron rlhf \
--loss_type grpo \
--sleep_level 2 \
--offload_model true \
--offload_bridge false \
--offload_optimizer true \
--log_interval 1 \
--recompute_granularity selective \
Expand Down
1 change: 1 addition & 0 deletions examples/megatron/grpo/moe_colocate_full.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ megatron rlhf \
--loss_type grpo \
--sleep_level 2 \
--offload_model true \
--offload_bridge false \
--offload_optimizer true \
--optimizer_cpu_offload true \
--use_precision_aware_optimizer \
Expand Down
1 change: 1 addition & 0 deletions examples/megatron/grpo/moe_colocate_lora.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ megatron rlhf \
--loss_type grpo \
--sleep_level 2 \
--offload_model true \
--offload_bridge false \
--offload_optimizer true \
--log_interval 1 \
--recompute_granularity selective \
Expand Down
1 change: 1 addition & 0 deletions swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ class RLHFMegatronArgumentsMixin:
sleep_level: Literal[0, 1, 2] = 0
offload_optimizer: bool = False
offload_model: bool = False
offload_bridge: bool = False

vllm_server_base_url: Optional[List[str]] = None
vllm_server_host: Optional[List[str]] = None
Expand Down
5 changes: 4 additions & 1 deletion swift/megatron/trainers/grpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,8 +222,11 @@ def _export_and_load_weights(self):
For server mode: Process weights in buckets to avoid memory spikes.
"""
# Export weights returns an iterator
target_device = None
if self.args.offload_bridge:
target_device = 'cpu'
with profiling_context(self, 'export_weights'):
weight_iterator = self.bridge.export_weights(self.unwrapped_models)
weight_iterator = self.bridge.export_weights(self.unwrapped_models, target_device=target_device)

if self.vllm_mode == 'colocate':
# Colocate mode: load_weights supports iterator, pass directly
Expand Down
Loading