You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .claude/skills/deployment/SKILL.md
+10-47Lines changed: 10 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
name: deployment
3
-
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
3
+
description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
4
4
license: Apache-2.0
5
5
---
6
6
@@ -79,32 +79,23 @@ Check the support matrix in `references/support-matrix.md` to confirm the model
79
79
80
80
### 3. Check the environment
81
81
82
-
**GPU availability:**
82
+
Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
83
83
84
-
```bash
85
-
python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')"
86
-
```
87
-
88
-
**Framework installed?**
84
+
Then check the **deployment framework** is installed:
89
85
90
86
```bash
91
-
# vLLM
92
87
python -c "import vllm; print(f'vLLM {vllm.__version__}')"2>/dev/null ||echo"vLLM not installed"
93
-
94
-
# SGLang
95
88
python -c "import sglang; print(f'SGLang {sglang.__version__}')"2>/dev/null ||echo"SGLang not installed"
96
-
97
-
# TRT-LLM
98
89
python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')"2>/dev/null ||echo"TRT-LLM not installed"
99
90
```
100
91
101
-
If the framework is not installed, consult `references/setup.md` for installation instructions.
All checks must pass before reporting success to the user.
181
172
182
-
### 6. Benchmark (optional)
183
-
184
-
If the user wants throughput/latency numbers, run a quick benchmark:
185
-
186
-
```bash
187
-
# vLLM benchmark
188
-
python -m vllm.entrypoints.openai.api_server ... &# if not already running
189
-
190
-
python -m vllm.benchmark_serving \
191
-
--model <model_name> \
192
-
--port 8000 \
193
-
--num-prompts 100 \
194
-
--request-rate 10
195
-
```
196
-
197
-
Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
198
-
199
-
### 7. Remote deployment (SSH/SLURM)
173
+
### 6. Remote deployment (SSH/SLURM)
200
174
201
175
If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
202
176
@@ -219,18 +193,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
219
193
220
194
3.**Deploy based on remote environment:**
221
195
222
-
-**SLURM** — write a job script that starts the server inside a container, then submit:
223
-
224
-
```bash
225
-
srun --container-image="<container.sqsh>" \
226
-
--container-mounts="<data_root>:<data_root>" \
227
-
python -m vllm.entrypoints.openai.api_server \
228
-
--model <remote_checkpoint_path> \
229
-
--quantization modelopt \
230
-
--host 0.0.0.0 --port 8000
231
-
```
232
-
233
-
Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
196
+
-**SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
234
197
235
198
-**Bare metal / Docker** — use `remote_run` to start the server directly:
| TRT-LLM | auto-detected from checkpoint | auto-detected from checkpoint |
52
52
53
+
## Models not in this list
54
+
55
+
This matrix covers officially validated combinations. For unlisted models:
56
+
57
+
1.**Check the framework's own docs** — vLLM and SGLang support many HuggingFace models natively. Use WebSearch to check `vllm supported models` or `sglang supported models`.
58
+
2.**Try it** — if the model uses standard `nn.Linear` layers and has `hf_quant_config.json`, vLLM/SGLang will likely work with `--quantization modelopt`.
59
+
3.**Ask the user** — if unsure, ask: "This model isn't in the validated support matrix. Would you like to try deploying it anyway?"
60
+
53
61
## Notes
54
62
55
63
-**NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference.
56
64
- INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang).
57
-
- Other models/formats may work but are not officially validated.
58
65
- Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`
0 commit comments