[npu] add supplementary content to the npu quick start doc (#6727)

addsubmuldiv · web-flow · commit 67719f45cfd1 · 2025-11-24T20:44:34.000+08:00
diff --git a/docs/source/BestPractices/NPU-support.md b/docs/source/BestPractices/NPU-support.md
@@ -1,25 +1,48 @@
 # NPU支持
 
+我们在 ms-swift 上增加了对昇腾 NPU 的支持，用户可以在昇腾 NPU 上进行模型的微调和推理。
+
+本文档介绍了如何在昇腾 NPU 上进行环境准备、模型微调、推理和部署。
+
+## 安装
+
+基础环境准备：
+
+| software  | version         |
+| --------- | --------------- |
+| Python    | >= 3.10, < 3.12 |
+| CANN      | == 8.3.RC1      |
+| torch     | == 2.7.1        |
+| torch_npu | == 2.7.1        |
+
+
+基础环境准备请参照这份 [Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)。
+
+
 ## 环境准备
 
-实验环境：8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持～)
+实验环境：8 * 昇腾910B3 64G（设备由 [@chuanzhubin](https://github.com/chuanzhubin) 提供，感谢对 ModelScope 和 Swift 的支持～）
 
 ```shell
-# 创建新的conda虚拟环境(可选)
+# 创建新的 conda 虚拟环境（可选）
 conda create -n swift-npu python=3.10 -y
 conda activate swift-npu
 
-# 设置pip全局镜像 (可选,加速下载)
+# 设置 pip 全局镜像（可选，加速下载）
 pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
 pip install ms-swift -U
 
-# 安装torch-npu
+# 安装 torch-npu
 pip install torch-npu decorator
-# 如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)
+# 如果你想要使用 deepspeed（控制显存占用，训练速度会有一定下降）
 pip install deepspeed
+
+# 如果需要使用 evaluation 功能，请安装以下包
+pip install evalscope[opencompass]
 ```
 
 测试环境是否安装正确，NPU能否被正常加载：
+
 ```python
 from transformers.utils import is_torch_npu_available
 import torch
@@ -30,6 +53,7 @@ print(torch.randn(10, device='npu:0'))
 ```
 
 查看NPU的P2P连接，这里看到每个NPU都通过7条HCCS与其他NPU互联
+
 ```shell
 (valle) root@valle:~/src# npu-smi info -t topo
 	   NPU0       NPU1       NPU2       NPU3       NPU4       NPU5       NPU6       NPU7       CPU Affinity
@@ -54,6 +78,7 @@ Legend:
 ```
 
 查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)
+
 ```shell
 (valle) root@valle:~/src# npu-smi info
 +------------------------------------------------------------------------------------------------+
@@ -89,19 +114,20 @@ Legend:
 ```
 
 ## 微调
+
 以下介绍LoRA的微调, 全参数微调设置参数`--train_type full`即可.
 
-| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量   |
-|------|-------|-------------|-----------|
-| 7B   | 1     | None        | 1 * 28 GB |
-| 7B   | 4     | None        | 4 * 22 GB |
-| 7B   | 4     | zero2       | 4 * 28 GB |
-| 7B   | 4     | zero3       | 4 * 22 GB |
-| 7B   | 8     | None        | 8 * 22 GB |
-| 14B  | 1     | None        | 1 * 45 GB |
-| 14B  | 8     | None        | 8 * 51 GB |
-| 14B  | 8     | zero2       | 8 * 49 GB |
-| 14B  | 8     | zero3       | 8 * 31 GB |
+| 模型大小 | NPU数量 | deepspeed类型 | 最大显存占用量 |
+| -------- | ------- | ------------- | -------------- |
+| 7B       | 1       | None          | 1 * 28 GB      |
+| 7B       | 4       | None          | 4 * 22 GB      |
+| 7B       | 4       | zero2         | 4 * 28 GB      |
+| 7B       | 4       | zero3         | 4 * 22 GB      |
+| 7B       | 8       | None          | 8 * 22 GB      |
+| 14B      | 1       | None          | 1 * 45 GB      |
+| 14B      | 8       | None          | 8 * 51 GB      |
+| 14B      | 8       | zero2         | 8 * 49 GB      |
+| 14B      | 8       | zero3         | 8 * 31 GB      |
 
 ### 单卡训练
 
@@ -128,6 +154,7 @@ swift sft \
 
 
 ### 数据并行训练
+
 我们使用其中的4卡进行ddp训练
 
 ```shell
@@ -150,6 +177,7 @@ swift sft \
 ### Deepspeed训练
 
 ZeRO2:
+
 ```shell
 # 实验环境: 4 * 昇腾910B3
 # 显存需求: 4 * 28GB
@@ -168,6 +196,7 @@ swift sft \
 ```
 
 ZeRO3:
+
 ```shell
 # 实验环境: 4 * 昇腾910B3
 # 显存需求: 4 * 22 GB
@@ -189,13 +218,15 @@ swift sft \
 ## 推理
 
 原始模型:
+
 ```shell
 ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
     --model Qwen/Qwen2-7B-Instruct \
     --stream true --max_new_tokens 2048
 ```
 
 LoRA微调后:
+
 ```shell
 ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
     --adapters xxx/checkpoint-xxx --load_data_args true \
@@ -211,18 +242,64 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
 
 
 ## 部署
+
 NPU不支持使用vllm进行推理/部署加速, 但是可以使用原生pytorch进行部署.
 
 原始模型:
+
 ```shell
 ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
 ```
 
 LoRA微调后:
+
 ```shell
 ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
 
 # merge-lora并推理
 ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
 ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
 ```
+
+## 支持现状
+
+### 表 1：SFT 类算法
+
+| algorithm | model families              | strategy              | hardware          |
+| --------- | --------------------------- | --------------------- | ----------------- |
+| SFT       | Qwen2.5-0.5B-Instruct       | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-1.5B-Instruct       | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-7B-Instruct         | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-VL-3B-Instruct      | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-VL-7B-Instruct      | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-Omni-3B             | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-8B                    | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-32B                   | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-VL-30B-A3B-Instruct   | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | InternVL3-8B                | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Ovis2.5-2B                  | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+
+------
+
+### 表 2：RL 类算法
+
+| algorithm | model families      | strategy  | rollout engine | hardware          |
+| --------- | ------------------- | --------- | -------------- | ----------------- |
+| **GRPO**  | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **GRPO**  | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **DPO**   | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **DPO**   | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **PPO**   | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **PPO**   | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+
+---
+
+### 表 3：当前 NPU 暂不支持 / 未完全验证的模块
+
+| item                   |
+| ---------------------- |
+| Liger-kernel           |
+| 量化/QLoRA相关         |
+| Megatron相关           |
+| 使用sglang作为推理引擎 |
diff --git a/docs/source_en/BestPractices/NPU-support.md b/docs/source_en/BestPractices/NPU-support.md
@@ -1,5 +1,22 @@
 # NPU Support
 
+We add Ascend NPU support in ms-swift, so you can fine-tune and run inference on Ascend NPUs.
+
+This document describes how to prepare the environment, fine-tune, run inference and deploy on NPUs.
+
+## Installation
+
+Base environment requirements:
+
+| Software  | Version         |
+| --------- | --------------- |
+| Python    | >= 3.10, < 3.12 |
+| CANN      | == 8.3.RC1      |
+| torch     | == 2.7.1        |
+| torch_npu | == 2.7.1        |
+
+For detailed environment setup, please refer to the [Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch).
+
 ## Environment Preparation
 
 Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
@@ -17,6 +34,9 @@ pip install ms-swift -U
 pip install torch-npu decorator
 # If you want to use deepspeed (to control memory usage, training speed might decrease)
 pip install deepspeed
+
+# If you need the evaluation functionality, please install the following package
+pip install evalscope[opencompass]
 ```
 
 Check if the test environment is installed correctly and whether the NPU can be loaded properly.
@@ -221,3 +241,46 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new
 ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
 ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
 ```
+
+## Current Support Status
+
+### Table 1: SFT Algorithms
+
+| Algorithm | Model Families              | Strategy              | Hardware          |
+| --------- | --------------------------- | --------------------- | ----------------- |
+| SFT       | Qwen2.5-0.5B-Instruct       | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-1.5B-Instruct       | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-7B-Instruct         | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-VL-3B-Instruct      | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-VL-7B-Instruct      | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen2.5-Omni-3B             | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-8B                    | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-32B                   | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-VL-30B-A3B-Instruct   | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | InternVL3-8B                | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+| SFT       | Ovis2.5-2B                  | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
+
+---
+
+### Table 2: RL Algorithms
+
+| Algorithm | Model Families      | Strategy  | Rollout Engine | Hardware          |
+| --------- | ------------------- | --------- | -------------- | ----------------- |
+| **GRPO**  | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **GRPO**  | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **DPO**   | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **DPO**   | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **PPO**   | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+| **PPO**   | Qwen3-8B            | deepspeed | vllm-ascend    | Atlas 900 A2 PODc |
+
+---
+
+### Table 3: Modules Not Yet Supported / Fully Verified on NPUs
+
+| Item                     |
+| ------------------------ |
+| Liger-kernel             |
+| Quantization/QLoRA       |
+| Megatron-related modules |
+| Using sglang as inference engine |