Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 93 additions & 16 deletions docs/source/BestPractices/NPU-support.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,48 @@
# NPU支持

我们在 ms-swift 上增加了对昇腾 NPU 的支持,用户可以在昇腾 NPU 上进行模型的微调和推理。

本文档介绍了如何在昇腾 NPU 上进行环境准备、模型微调、推理和部署。

## 安装

基础环境准备:

| software | version |
| --------- | --------------- |
| Python | >= 3.10, < 3.12 |
| CANN | == 8.3.RC1 |
| torch | == 2.7.1 |
| torch_npu | == 2.7.1 |


基础环境准备请参照这份 [Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)。


## 环境准备

实验环境:8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持~)
实验环境:8 * 昇腾910B3 64G设备由 [@chuanzhubin](https://github.com/chuanzhubin) 提供,感谢对 ModelScope 和 Swift 的支持~)

```shell
# 创建新的conda虚拟环境(可选)
# 创建新的 conda 虚拟环境(可选)
conda create -n swift-npu python=3.10 -y
conda activate swift-npu

# 设置pip全局镜像 (可选,加速下载)
# 设置 pip 全局镜像(可选,加速下载
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip install ms-swift -U

# 安装torch-npu
# 安装 torch-npu
pip install torch-npu decorator
# 如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)
# 如果你想要使用 deepspeed(控制显存占用训练速度会有一定下降
pip install deepspeed

# 如果需要使用 evaluation 功能,请安装以下包
pip install evalscope[opencompass]
```

测试环境是否安装正确,NPU能否被正常加载:

```python
from transformers.utils import is_torch_npu_available
import torch
Expand All @@ -30,6 +53,7 @@ print(torch.randn(10, device='npu:0'))
```

查看NPU的P2P连接,这里看到每个NPU都通过7条HCCS与其他NPU互联

```shell
(valle) root@valle:~/src# npu-smi info -t topo
NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
Expand All @@ -54,6 +78,7 @@ Legend:
```

查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)

```shell
(valle) root@valle:~/src# npu-smi info
+------------------------------------------------------------------------------------------------+
Expand Down Expand Up @@ -89,19 +114,20 @@ Legend:
```

## 微调

以下介绍LoRA的微调, 全参数微调设置参数`--train_type full`即可.

| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
|------|-------|-------------|-----------|
| 7B | 1 | None | 1 * 28 GB |
| 7B | 4 | None | 4 * 22 GB |
| 7B | 4 | zero2 | 4 * 28 GB |
| 7B | 4 | zero3 | 4 * 22 GB |
| 7B | 8 | None | 8 * 22 GB |
| 14B | 1 | None | 1 * 45 GB |
| 14B | 8 | None | 8 * 51 GB |
| 14B | 8 | zero2 | 8 * 49 GB |
| 14B | 8 | zero3 | 8 * 31 GB |
| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
| -------- | ------- | ------------- | -------------- |
| 7B | 1 | None | 1 * 28 GB |
| 7B | 4 | None | 4 * 22 GB |
| 7B | 4 | zero2 | 4 * 28 GB |
| 7B | 4 | zero3 | 4 * 22 GB |
| 7B | 8 | None | 8 * 22 GB |
| 14B | 1 | None | 1 * 45 GB |
| 14B | 8 | None | 8 * 51 GB |
| 14B | 8 | zero2 | 8 * 49 GB |
| 14B | 8 | zero3 | 8 * 31 GB |

### 单卡训练

Expand All @@ -128,6 +154,7 @@ swift sft \


### 数据并行训练

我们使用其中的4卡进行ddp训练

```shell
Expand All @@ -150,6 +177,7 @@ swift sft \
### Deepspeed训练

ZeRO2:

```shell
# 实验环境: 4 * 昇腾910B3
# 显存需求: 4 * 28GB
Expand All @@ -168,6 +196,7 @@ swift sft \
```

ZeRO3:

```shell
# 实验环境: 4 * 昇腾910B3
# 显存需求: 4 * 22 GB
Expand All @@ -189,13 +218,15 @@ swift sft \
## 推理

原始模型:

```shell
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
--model Qwen/Qwen2-7B-Instruct \
--stream true --max_new_tokens 2048
```

LoRA微调后:

```shell
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
--adapters xxx/checkpoint-xxx --load_data_args true \
Expand All @@ -211,18 +242,64 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \


## 部署

NPU不支持使用vllm进行推理/部署加速, 但是可以使用原生pytorch进行部署.

原始模型:

```shell
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
```

LoRA微调后:

```shell
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048

# merge-lora并推理
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
```

## 支持现状

### 表 1:SFT 类算法

| algorithm | model families | strategy | hardware |
| --------- | --------------------------- | --------------------- | ----------------- |
| SFT | Qwen2.5-0.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-1.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-Omni-3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |

------

### 表 2:RL 类算法

| algorithm | model families | strategy | rollout engine | hardware |
| --------- | ------------------- | --------- | -------------- | ----------------- |
| **GRPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **GRPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **DPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **DPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **PPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **PPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |

---
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Inconsistent horizontal rule formatting. Line 284 uses "------" (6 dashes) while line 297 uses "---" (3 dashes). For consistency, use the same number of dashes throughout the document.

Suggested change
---
------

Copilot uses AI. Check for mistakes.

### 表 3:当前 NPU 暂不支持 / 未完全验证的模块

| item |
| ---------------------- |
| Liger-kernel |
| 量化/QLoRA相关 |
| Megatron相关 |
| 使用sglang作为推理引擎 |
63 changes: 63 additions & 0 deletions docs/source_en/BestPractices/NPU-support.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
# NPU Support

We add Ascend NPU support in ms-swift, so you can fine-tune and run inference on Ascend NPUs.

This document describes how to prepare the environment, fine-tune, run inference and deploy on NPUs.

## Installation

Base environment requirements:

| Software | Version |
| --------- | --------------- |
| Python | >= 3.10, < 3.12 |
| CANN | == 8.3.RC1 |
| torch | == 2.7.1 |
| torch_npu | == 2.7.1 |

For detailed environment setup, please refer to the [Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch).

## Environment Preparation

Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
Expand All @@ -17,6 +34,9 @@ pip install ms-swift -U
pip install torch-npu decorator
# If you want to use deepspeed (to control memory usage, training speed might decrease)
pip install deepspeed

# If you need the evaluation functionality, please install the following package
pip install evalscope[opencompass]
```

Check if the test environment is installed correctly and whether the NPU can be loaded properly.
Expand Down Expand Up @@ -221,3 +241,46 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
```

## Current Support Status

### Table 1: SFT Algorithms

| Algorithm | Model Families | Strategy | Hardware |
| --------- | --------------------------- | --------------------- | ----------------- |
| SFT | Qwen2.5-0.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-1.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-Omni-3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |

---

### Table 2: RL Algorithms

| Algorithm | Model Families | Strategy | Rollout Engine | Hardware |
| --------- | ------------------- | --------- | -------------- | ----------------- |
| **GRPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **GRPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **DPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **DPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **PPO** | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| **PPO** | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |

---

### Table 3: Modules Not Yet Supported / Fully Verified on NPUs

| Item |
| ------------------------ |
| Liger-kernel |
| Quantization/QLoRA |
| Megatron-related modules |
| Using sglang as inference engine |