Skip to content

Commit 67719f4

Browse files
authored
[npu] add supplementary content to the npu quick start doc (#6727)
1 parent 0c41385 commit 67719f4

File tree

2 files changed

+156
-16
lines changed

2 files changed

+156
-16
lines changed

docs/source/BestPractices/NPU-support.md

Lines changed: 93 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,48 @@
11
# NPU支持
22

3+
我们在 ms-swift 上增加了对昇腾 NPU 的支持,用户可以在昇腾 NPU 上进行模型的微调和推理。
4+
5+
本文档介绍了如何在昇腾 NPU 上进行环境准备、模型微调、推理和部署。
6+
7+
## 安装
8+
9+
基础环境准备:
10+
11+
| software | version |
12+
| --------- | --------------- |
13+
| Python | >= 3.10, < 3.12 |
14+
| CANN | == 8.3.RC1 |
15+
| torch | == 2.7.1 |
16+
| torch_npu | == 2.7.1 |
17+
18+
19+
基础环境准备请参照这份 [Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)
20+
21+
322
## 环境准备
423

5-
实验环境:8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持~)
24+
实验环境:8 * 昇腾910B3 64G设备由 [@chuanzhubin](https://github.com/chuanzhubin) 提供,感谢对 ModelScope 和 Swift 的支持~)
625

726
```shell
8-
# 创建新的conda虚拟环境(可选)
27+
# 创建新的 conda 虚拟环境(可选)
928
conda create -n swift-npu python=3.10 -y
1029
conda activate swift-npu
1130

12-
# 设置pip全局镜像 (可选,加速下载)
31+
# 设置 pip 全局镜像(可选,加速下载
1332
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
1433
pip install ms-swift -U
1534

16-
# 安装torch-npu
35+
# 安装 torch-npu
1736
pip install torch-npu decorator
18-
# 如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)
37+
# 如果你想要使用 deepspeed(控制显存占用训练速度会有一定下降
1938
pip install deepspeed
39+
40+
# 如果需要使用 evaluation 功能,请安装以下包
41+
pip install evalscope[opencompass]
2042
```
2143

2244
测试环境是否安装正确,NPU能否被正常加载:
45+
2346
```python
2447
from transformers.utils import is_torch_npu_available
2548
import torch
@@ -30,6 +53,7 @@ print(torch.randn(10, device='npu:0'))
3053
```
3154

3255
查看NPU的P2P连接,这里看到每个NPU都通过7条HCCS与其他NPU互联
56+
3357
```shell
3458
(valle) root@valle:~/src# npu-smi info -t topo
3559
NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
@@ -54,6 +78,7 @@ Legend:
5478
```
5579

5680
查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)
81+
5782
```shell
5883
(valle) root@valle:~/src# npu-smi info
5984
+------------------------------------------------------------------------------------------------+
@@ -89,19 +114,20 @@ Legend:
89114
```
90115

91116
## 微调
117+
92118
以下介绍LoRA的微调, 全参数微调设置参数`--train_type full`即可.
93119

94-
| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
95-
|------|-------|-------------|-----------|
96-
| 7B | 1 | None | 1 * 28 GB |
97-
| 7B | 4 | None | 4 * 22 GB |
98-
| 7B | 4 | zero2 | 4 * 28 GB |
99-
| 7B | 4 | zero3 | 4 * 22 GB |
100-
| 7B | 8 | None | 8 * 22 GB |
101-
| 14B | 1 | None | 1 * 45 GB |
102-
| 14B | 8 | None | 8 * 51 GB |
103-
| 14B | 8 | zero2 | 8 * 49 GB |
104-
| 14B | 8 | zero3 | 8 * 31 GB |
120+
| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
121+
| -------- | ------- | ------------- | -------------- |
122+
| 7B | 1 | None | 1 * 28 GB |
123+
| 7B | 4 | None | 4 * 22 GB |
124+
| 7B | 4 | zero2 | 4 * 28 GB |
125+
| 7B | 4 | zero3 | 4 * 22 GB |
126+
| 7B | 8 | None | 8 * 22 GB |
127+
| 14B | 1 | None | 1 * 45 GB |
128+
| 14B | 8 | None | 8 * 51 GB |
129+
| 14B | 8 | zero2 | 8 * 49 GB |
130+
| 14B | 8 | zero3 | 8 * 31 GB |
105131

106132
### 单卡训练
107133

@@ -128,6 +154,7 @@ swift sft \
128154

129155

130156
### 数据并行训练
157+
131158
我们使用其中的4卡进行ddp训练
132159

133160
```shell
@@ -150,6 +177,7 @@ swift sft \
150177
### Deepspeed训练
151178

152179
ZeRO2:
180+
153181
```shell
154182
# 实验环境: 4 * 昇腾910B3
155183
# 显存需求: 4 * 28GB
@@ -168,6 +196,7 @@ swift sft \
168196
```
169197

170198
ZeRO3:
199+
171200
```shell
172201
# 实验环境: 4 * 昇腾910B3
173202
# 显存需求: 4 * 22 GB
@@ -189,13 +218,15 @@ swift sft \
189218
## 推理
190219

191220
原始模型:
221+
192222
```shell
193223
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
194224
--model Qwen/Qwen2-7B-Instruct \
195225
--stream true --max_new_tokens 2048
196226
```
197227

198228
LoRA微调后:
229+
199230
```shell
200231
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
201232
--adapters xxx/checkpoint-xxx --load_data_args true \
@@ -211,18 +242,64 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
211242

212243

213244
## 部署
245+
214246
NPU不支持使用vllm进行推理/部署加速, 但是可以使用原生pytorch进行部署.
215247

216248
原始模型:
249+
217250
```shell
218251
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
219252
```
220253

221254
LoRA微调后:
255+
222256
```shell
223257
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
224258

225259
# merge-lora并推理
226260
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
227261
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
228262
```
263+
264+
## 支持现状
265+
266+
### 表 1:SFT 类算法
267+
268+
| algorithm | model families | strategy | hardware |
269+
| --------- | --------------------------- | --------------------- | ----------------- |
270+
| SFT | Qwen2.5-0.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
271+
| SFT | Qwen2.5-1.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
272+
| SFT | Qwen2.5-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
273+
| SFT | Qwen2.5-VL-3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
274+
| SFT | Qwen2.5-VL-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
275+
| SFT | Qwen2.5-Omni-3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
276+
| SFT | Qwen3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
277+
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
278+
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
279+
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
280+
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
281+
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
282+
283+
------
284+
285+
### 表 2:RL 类算法
286+
287+
| algorithm | model families | strategy | rollout engine | hardware |
288+
| --------- | ------------------- | --------- | -------------- | ----------------- |
289+
| **GRPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
290+
| **GRPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
291+
| **DPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
292+
| **DPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
293+
| **PPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
294+
| **PPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
295+
296+
---
297+
298+
### 表 3:当前 NPU 暂不支持 / 未完全验证的模块
299+
300+
| item |
301+
| ---------------------- |
302+
| Liger-kernel |
303+
| 量化/QLoRA相关 |
304+
| Megatron相关 |
305+
| 使用sglang作为推理引擎 |

docs/source_en/BestPractices/NPU-support.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
# NPU Support
22

3+
We add Ascend NPU support in ms-swift, so you can fine-tune and run inference on Ascend NPUs.
4+
5+
This document describes how to prepare the environment, fine-tune, run inference and deploy on NPUs.
6+
7+
## Installation
8+
9+
Base environment requirements:
10+
11+
| Software | Version |
12+
| --------- | --------------- |
13+
| Python | >= 3.10, < 3.12 |
14+
| CANN | == 8.3.RC1 |
15+
| torch | == 2.7.1 |
16+
| torch_npu | == 2.7.1 |
17+
18+
For detailed environment setup, please refer to the [Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch).
19+
320
## Environment Preparation
421

522
Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
@@ -17,6 +34,9 @@ pip install ms-swift -U
1734
pip install torch-npu decorator
1835
# If you want to use deepspeed (to control memory usage, training speed might decrease)
1936
pip install deepspeed
37+
38+
# If you need the evaluation functionality, please install the following package
39+
pip install evalscope[opencompass]
2040
```
2141

2242
Check if the test environment is installed correctly and whether the NPU can be loaded properly.
@@ -221,3 +241,46 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new
221241
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
222242
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
223243
```
244+
245+
## Current Support Status
246+
247+
### Table 1: SFT Algorithms
248+
249+
| Algorithm | Model Families | Strategy | Hardware |
250+
| --------- | --------------------------- | --------------------- | ----------------- |
251+
| SFT | Qwen2.5-0.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
252+
| SFT | Qwen2.5-1.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
253+
| SFT | Qwen2.5-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
254+
| SFT | Qwen2.5-VL-3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
255+
| SFT | Qwen2.5-VL-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
256+
| SFT | Qwen2.5-Omni-3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
257+
| SFT | Qwen3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
258+
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
259+
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
260+
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
261+
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
262+
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
263+
264+
---
265+
266+
### Table 2: RL Algorithms
267+
268+
| Algorithm | Model Families | Strategy | Rollout Engine | Hardware |
269+
| --------- | ------------------- | --------- | -------------- | ----------------- |
270+
| **GRPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
271+
| **GRPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
272+
| **DPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
273+
| **DPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
274+
| **PPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
275+
| **PPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
276+
277+
---
278+
279+
### Table 3: Modules Not Yet Supported / Fully Verified on NPUs
280+
281+
| Item |
282+
| ------------------------ |
283+
| Liger-kernel |
284+
| Quantization/QLoRA |
285+
| Megatron-related modules |
286+
| Using sglang as inference engine |

0 commit comments

Comments
 (0)