Skip to content

Commit b79c4e1

Browse files
committed
Address review comments
Signed-off-by: Kai Xu <kaix@nvidia.com>
1 parent 6e09a4d commit b79c4e1

File tree

7 files changed

+111
-90
lines changed

7 files changed

+111
-90
lines changed

.claude/skills/deployment/SKILL.md

Lines changed: 10 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: deployment
3-
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
3+
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
44
license: Apache-2.0
55
---
66

@@ -79,32 +79,23 @@ Check the support matrix in `references/support-matrix.md` to confirm the model
7979

8080
### 3. Check the environment
8181

82-
**GPU availability:**
82+
Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
8383

84-
```bash
85-
python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')"
86-
```
87-
88-
**Framework installed?**
84+
Then check the **deployment framework** is installed:
8985

9086
```bash
91-
# vLLM
9287
python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
93-
94-
# SGLang
9588
python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
96-
97-
# TRT-LLM
9889
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
9990
```
10091

101-
If the framework is not installed, consult `references/setup.md` for installation instructions.
92+
If not installed, consult `references/setup.md`.
10293

103-
**GPU memory estimate:**
94+
**GPU memory estimate** (to determine tensor parallelism):
10495

105-
- BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB)
106-
- FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB)
107-
- FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB)
96+
- BF16: `params × 2 bytes` (8B ≈ 16 GB)
97+
- FP8: `params × 1 byte` (8B ≈ 8 GB)
98+
- FP4: `params × 0.5 bytes` (8B ≈ 4 GB)
10899
- Add ~2-4 GB for KV cache and framework overhead
109100

110101
If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
@@ -179,24 +170,7 @@ curl -s http://localhost:8000/v1/completions \
179170

180171
All checks must pass before reporting success to the user.
181172

182-
### 6. Benchmark (optional)
183-
184-
If the user wants throughput/latency numbers, run a quick benchmark:
185-
186-
```bash
187-
# vLLM benchmark
188-
python -m vllm.entrypoints.openai.api_server ... & # if not already running
189-
190-
python -m vllm.benchmark_serving \
191-
--model <model_name> \
192-
--port 8000 \
193-
--num-prompts 100 \
194-
--request-rate 10
195-
```
196-
197-
Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
198-
199-
### 7. Remote deployment (SSH/SLURM)
173+
### 6. Remote deployment (SSH/SLURM)
200174

201175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
202176

@@ -219,18 +193,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
219193

220194
3. **Deploy based on remote environment:**
221195

222-
- **SLURM** — write a job script that starts the server inside a container, then submit:
223-
224-
```bash
225-
srun --container-image="<container.sqsh>" \
226-
--container-mounts="<data_root>:<data_root>" \
227-
python -m vllm.entrypoints.openai.api_server \
228-
--model <remote_checkpoint_path> \
229-
--quantization modelopt \
230-
--host 0.0.0.0 --port 8000
231-
```
232-
233-
Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
196+
- **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
234197

235198
- **Bare metal / Docker** — use `remote_run` to start the server directly:
236199

.claude/skills/deployment/references/setup.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,16 +70,37 @@ squeue -u $USER -o "%j %N %S" # Get the node name
7070

7171
## Docker Deployment
7272

73-
### vLLM with ModelOpt
73+
### Official Images (recommended)
7474

75-
A Dockerfile is available at `examples/vllm_serve/Dockerfile`:
75+
| Framework | Image | Source |
76+
|-----------|-------|--------|
77+
| vLLM | `vllm/vllm-openai:latest` | <https://hub.docker.com/r/vllm/vllm-openai> |
78+
| SGLang | `lmsysorg/sglang:latest` | <https://hub.docker.com/r/lmsysorg/sglang> |
79+
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:latest` | <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/> |
80+
81+
Example with the official vLLM image:
82+
83+
```bash
84+
docker run --gpus all -p 8000:8000 \
85+
-v /path/to/checkpoint:/model \
86+
vllm/vllm-openai:latest \
87+
--model /model \
88+
--quantization modelopt \
89+
--host 0.0.0.0 --port 8000
90+
```
91+
92+
### Custom Image (optional)
93+
94+
A Dockerfile is also available at `examples/vllm_serve/Dockerfile` if you need a custom build:
7695

7796
```bash
7897
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
7998

80-
docker run --gpus all -p 8000:8000 vllm-modelopt \
99+
docker run --gpus all -p 8000:8000 \
100+
-v /path/to/checkpoint:/model \
101+
vllm-modelopt \
81102
python -m vllm.entrypoints.openai.api_server \
82-
--model <checkpoint_path> \
103+
--model /model \
83104
--quantization modelopt \
84105
--host 0.0.0.0 --port 8000
85106
```

.claude/skills/deployment/references/support-matrix.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,16 @@
5050
| SGLang | `quantization="modelopt"` | `quantization="modelopt_fp4"` |
5151
| TRT-LLM | auto-detected from checkpoint | auto-detected from checkpoint |
5252

53+
## Models not in this list
54+
55+
This matrix covers officially validated combinations. For unlisted models:
56+
57+
1. **Check the framework's own docs** — vLLM and SGLang support many HuggingFace models natively. Use WebSearch to check `vllm supported models` or `sglang supported models`.
58+
2. **Try it** — if the model uses standard `nn.Linear` layers and has `hf_quant_config.json`, vLLM/SGLang will likely work with `--quantization modelopt`.
59+
3. **Ask the user** — if unsure, ask: "This model isn't in the validated support matrix. Would you like to try deploying it anyway?"
60+
5361
## Notes
5462

5563
- **NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference.
5664
- INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang).
57-
- Other models/formats may work but are not officially validated.
5865
- Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`

.claude/skills/deployment/scripts/deploy.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,9 +127,18 @@ print(quant_algo)
127127
return 1
128128
}
129129

130-
if echo "$quant_algo" | grep -qi "fp4"; then
130+
if echo "$quant_algo" | grep -qi "fp4\|nvfp4"; then
131131
echo "modelopt_fp4"
132+
elif echo "$quant_algo" | grep -qi "fp8"; then
133+
echo "modelopt"
134+
elif echo "$quant_algo" | grep -qi "int4_awq\|w4a8_awq\|w4a16_awq\|w8a"; then
135+
log_error "Quantization format '$quant_algo' is only supported by TRT-LLM, not vLLM/SGLang"
136+
log_error "Use --framework trtllm or deploy with TRT-LLM directly"
137+
return 1
138+
elif [[ -z "$quant_algo" ]]; then
139+
echo "none"
132140
else
141+
log_warn "Unknown quant_algo '$quant_algo' — trying --quantization modelopt"
133142
echo "modelopt"
134143
fi
135144
elif [[ -f "$model_path/config.json" ]]; then
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
[
2+
{
3+
"name": "vllm-fp8-local",
4+
"skills": ["deployment"],
5+
"query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM",
6+
"files": [],
7+
"expected_behavior": [
8+
"Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint",
9+
"Reads hf_quant_config.json and detects FP8 quantization format",
10+
"Confirms vLLM is the chosen framework",
11+
"Checks vLLM is installed and version >= 0.10.1",
12+
"Detects local GPU via nvidia-smi or torch.cuda",
13+
"Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU",
14+
"Reads references/vllm.md for deployment instructions",
15+
"Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000",
16+
"Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8",
17+
"Waits for server health check at /health endpoint",
18+
"Verifies /v1/models lists the model",
19+
"Sends test generation request to /v1/completions and confirms coherent output",
20+
"Reports server URL (http://localhost:8000) and port to user"
21+
]
22+
},
23+
{
24+
"name": "remote-slurm-deployment",
25+
"skills": ["deployment"],
26+
"query": "deploy my quantized model on the SLURM cluster",
27+
"files": [],
28+
"expected_behavior": [
29+
"Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml",
30+
"Sources .claude/skills/common/remote_exec.sh",
31+
"Calls remote_load_cluster, remote_check_ssh, remote_detect_env",
32+
"Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local",
33+
"For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)",
34+
"Starts vLLM/SGLang server inside the container via srun",
35+
"Gets allocated node hostname from squeue -j $JOBID -o %N",
36+
"Verifies remotely: remote_run 'curl -s http://localhost:8000/health'",
37+
"Reports the remote endpoint (http://<node_hostname>:8000) and notes SLURM network restrictions",
38+
"Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags"
39+
]
40+
},
41+
{
42+
"name": "unquantized-hf-model",
43+
"skills": ["deployment"],
44+
"query": "deploy Qwen/Qwen3-0.6B with vLLM",
45+
"files": [],
46+
"expected_behavior": [
47+
"Identifies Qwen/Qwen3-0.6B as a HuggingFace model ID (not a local path)",
48+
"Detects no quantization format in the model name — treats as unquantized (BF16)",
49+
"Does not pass --quantization flag to vLLM",
50+
"Checks vLLM is installed and GPU is available",
51+
"Estimates memory: 0.6B params x 2 bytes = ~1.2 GB, fits single GPU",
52+
"Starts vLLM server: python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000",
53+
"Waits for health check at /health endpoint",
54+
"Tests generation via /v1/completions and confirms coherent output",
55+
"Reports server URL to user"
56+
]
57+
}
58+
]

.claude/skills/deployment/tests/remote-slurm-deployment.json

Lines changed: 0 additions & 17 deletions
This file was deleted.

.claude/skills/deployment/tests/vllm-fp8-local.json

Lines changed: 0 additions & 20 deletions
This file was deleted.

0 commit comments

Comments
 (0)