-
Notifications
You must be signed in to change notification settings - Fork 324
Add Agent Deployment skill for model serving #1133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,268 @@ | ||
| --- | ||
| name: deployment | ||
| description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. | ||
| license: Apache-2.0 | ||
| --- | ||
|
|
||
| # Deployment Skill | ||
|
|
||
| Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy). | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment. | ||
|
|
||
| ```bash | ||
| # Start vLLM server with a ModelOpt checkpoint | ||
| scripts/deploy.sh start --model ./qwen3-0.6b-fp8 | ||
|
|
||
| # Start with SGLang and tensor parallelism | ||
| scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4 | ||
|
|
||
| # Start from HuggingFace hub | ||
| scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8 | ||
|
|
||
| # Test the API | ||
| scripts/deploy.sh test | ||
|
|
||
| # Check status | ||
| scripts/deploy.sh status | ||
|
|
||
| # Stop | ||
| scripts/deploy.sh stop | ||
| ``` | ||
|
Comment on lines
+15
to
+33
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify script path relative to repository root. The examples use 📝 Suggested clarificationEither update paths to be absolute from repo root: .claude/skills/deployment/scripts/deploy.sh start --model ./qwen3-0.6b-fp8Or add a note that examples assume the working directory is 🤖 Prompt for AI Agents |
||
|
|
||
| The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing. | ||
|
|
||
| ## Decision Flow | ||
|
|
||
| ### 0. Check workspace (multi-user / Slack bot) | ||
|
|
||
| If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run: | ||
|
|
||
| ```bash | ||
| ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null | ||
| ``` | ||
|
|
||
| If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory. | ||
|
|
||
| ### 1. Identify the checkpoint | ||
|
|
||
| Determine what the user wants to deploy: | ||
|
|
||
| - **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command. | ||
| - **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly | ||
| - **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill | ||
|
|
||
| > **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`. | ||
|
|
||
| Check the quantization format if applicable: | ||
|
|
||
| ```bash | ||
| cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json" | ||
| ``` | ||
|
|
||
| If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized. | ||
|
|
||
| ### 2. Choose the framework | ||
|
|
||
| If the user hasn't specified a framework, recommend based on this priority: | ||
|
|
||
| | Situation | Recommended | Why | | ||
| |-----------|-------------|-----| | ||
| | General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible | | ||
| | Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support | | ||
| | Maximum optimization | **TRT-LLM** | Best throughput via engine compilation | | ||
| | Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints | | ||
|
|
||
| Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported. | ||
|
|
||
| ### 3. Check the environment | ||
|
|
||
| **GPU availability:** | ||
|
|
||
| ```bash | ||
| python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')" | ||
| ``` | ||
|
|
||
| **Framework installed?** | ||
|
|
||
| ```bash | ||
| # vLLM | ||
| python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed" | ||
|
|
||
| # SGLang | ||
| python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed" | ||
|
|
||
| # TRT-LLM | ||
| python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed" | ||
| ``` | ||
|
|
||
| If the framework is not installed, consult `references/setup.md` for installation instructions. | ||
|
|
||
| **GPU memory estimate:** | ||
|
|
||
| - BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB) | ||
| - FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB) | ||
| - FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB) | ||
| - Add ~2-4 GB for KV cache and framework overhead | ||
|
|
||
| If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`). | ||
|
Comment on lines
+80
to
+110
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about we delegate this part to The PTQ skill (PR #1107) delegates to |
||
|
|
||
| ### 4. Deploy | ||
|
|
||
| Read the framework-specific reference for detailed instructions: | ||
|
|
||
| | Framework | Reference file | | ||
| |-----------|---------------| | ||
| | vLLM | `references/vllm.md` | | ||
| | SGLang | `references/sglang.md` | | ||
| | TRT-LLM | `references/trtllm.md` | | ||
|
|
||
| **Quick-start commands** (for common cases): | ||
|
|
||
| #### vLLM | ||
|
|
||
| ```bash | ||
| # Serve as OpenAI-compatible endpoint | ||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model <checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --tensor-parallel-size <num_gpus> \ | ||
| --host 0.0.0.0 --port 8000 | ||
| ``` | ||
|
|
||
| For NVFP4 checkpoints, use `--quantization modelopt_fp4`. | ||
|
|
||
| #### SGLang | ||
|
|
||
| ```bash | ||
| python -m sglang.launch_server \ | ||
| --model-path <checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --tp <num_gpus> \ | ||
| --host 0.0.0.0 --port 8000 | ||
| ``` | ||
|
|
||
| #### TRT-LLM (direct) | ||
|
|
||
| ```python | ||
| from tensorrt_llm import LLM, SamplingParams | ||
| llm = LLM(model="<checkpoint_path>") | ||
| outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95)) | ||
| ``` | ||
|
|
||
| #### TRT-LLM AutoDeploy | ||
|
|
||
| For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`. | ||
|
|
||
| ### 5. Verify the deployment | ||
|
|
||
| After the server starts, verify it's healthy: | ||
|
|
||
| ```bash | ||
| # Health check | ||
| curl -s http://localhost:8000/health | ||
|
|
||
| # List models | ||
| curl -s http://localhost:8000/v1/models | python -m json.tool | ||
|
|
||
| # Test generation | ||
| curl -s http://localhost:8000/v1/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "<model_name>", | ||
| "prompt": "The capital of France is", | ||
| "max_tokens": 32 | ||
| }' | python -m json.tool | ||
| ``` | ||
|
|
||
| All checks must pass before reporting success to the user. | ||
|
|
||
| ### 6. Benchmark (optional) | ||
|
|
||
| If the user wants throughput/latency numbers, run a quick benchmark: | ||
|
|
||
| ```bash | ||
| # vLLM benchmark | ||
| python -m vllm.entrypoints.openai.api_server ... & # if not already running | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks only for vllm, we can remove this part if it's not general. |
||
|
|
||
| python -m vllm.benchmark_serving \ | ||
| --model <model_name> \ | ||
| --port 8000 \ | ||
| --num-prompts 100 \ | ||
| --request-rate 10 | ||
| ``` | ||
|
|
||
| Report: throughput (tok/s), latency p50/p99, time to first token (TTFT). | ||
|
|
||
| ### 7. Remote deployment (SSH/SLURM) | ||
|
|
||
| If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine: | ||
|
|
||
| 1. **Source remote utilities:** | ||
|
|
||
| ```bash | ||
| source .claude/skills/common/remote_exec.sh | ||
| remote_load_cluster | ||
| remote_check_ssh | ||
| remote_detect_env | ||
| ``` | ||
|
|
||
| 2. **Sync the checkpoint** (only if it was produced locally): | ||
|
|
||
| If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local: | ||
|
|
||
| ```bash | ||
| remote_sync_to <local_checkpoint_path> checkpoints/ | ||
| ``` | ||
|
|
||
| 3. **Deploy based on remote environment:** | ||
|
|
||
| - **SLURM** — write a job script that starts the server inside a container, then submit: | ||
|
|
||
| ```bash | ||
| srun --container-image="<container.sqsh>" \ | ||
| --container-mounts="<data_root>:<data_root>" \ | ||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model <remote_checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --host 0.0.0.0 --port 8000 | ||
| ``` | ||
|
Comment on lines
+222
to
+231
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So here and The PTQ skill has |
||
|
|
||
| Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`. | ||
|
|
||
| - **Bare metal / Docker** — use `remote_run` to start the server directly: | ||
|
|
||
| ```bash | ||
| remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &" | ||
| ``` | ||
|
|
||
| 4. **Verify remotely:** | ||
|
|
||
| ```bash | ||
| remote_run "curl -s http://localhost:8000/health" | ||
| remote_run "curl -s http://localhost:8000/v1/models" | ||
| ``` | ||
|
|
||
| 5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network. | ||
|
|
||
| For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically. | ||
|
|
||
| ## Error Handling | ||
|
|
||
| | Error | Cause | Fix | | ||
| |-------|-------|-----| | ||
| | `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model | | ||
| | `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 | | ||
| | `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag | | ||
| | `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors | | ||
| | `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` | | ||
|
|
||
|
Comment on lines
+254
to
+261
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: vLLM's ModelOpt quantization requires a minimum CUDA compute capability of 8.9 (Hopper GPUs like H100 and newer). There is no explicit minimum vLLM software version requirement stated in the official documentation; support is available in recent versions as evidenced by the current docs (published 2025-10-23). ModelOpt quantization was added before April 2025, with extensions like Mixtral support merged around then (PR Citations:
🌐 Web query:
💡 Result: SGLang's native ModelOpt quantization support requires SGLang version 0.5 or later. This feature was introduced via merged pull requests Citations:
🌐 Web query:
💡 Result: vLLM added support for ModelOpt (NVIDIA Model Optimizer) quantization in September 2024 via PR Citations:
Correct SGLang version requirement for ModelOpt quantization support. The error table states SGLang >= 0.4.10 for For vLLM, while official documentation does not explicitly state a minimum version, v0.10.1 is a reasonable conservative estimate given ModelOpt support was added in September 2024; this can remain as-is or be updated to "latest version recommended" if you prefer to match vLLM's official guidance. 🤖 Prompt for AI Agents |
||
| ## Success Criteria | ||
|
|
||
| 1. Server process is running and healthy (`/health` returns 200) | ||
| 2. Model is listed at `/v1/models` | ||
| 3. Test generation produces coherent output | ||
| 4. Server URL and port are reported to the user | ||
| 5. If benchmarking was requested, throughput/latency numbers are reported | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about renaming
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| { | ||
| "skills": ["deployment"], | ||
| "query": "deploy my quantized model on the SLURM cluster", | ||
| "files": [], | ||
| "expected_behavior": [ | ||
| "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml", | ||
| "Sources .claude/skills/common/remote_exec.sh", | ||
| "Calls remote_load_cluster, remote_check_ssh, remote_detect_env", | ||
| "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local", | ||
| "For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)", | ||
| "Starts vLLM/SGLang server inside the container via srun", | ||
| "Gets allocated node hostname from squeue -j $JOBID -o %N", | ||
| "Verifies remotely: remote_run 'curl -s http://localhost:8000/health'", | ||
| "Reports the remote endpoint (http://<node_hostname>:8000) and notes SLURM network restrictions", | ||
| "Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags" | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| { | ||
| "skills": ["deployment"], | ||
| "query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM", | ||
| "files": [], | ||
| "expected_behavior": [ | ||
| "Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint", | ||
| "Reads hf_quant_config.json and detects FP8 quantization format", | ||
| "Confirms vLLM is the chosen framework", | ||
| "Checks vLLM is installed and version >= 0.10.1", | ||
| "Detects local GPU via nvidia-smi or torch.cuda", | ||
| "Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU", | ||
| "Reads references/vllm.md for deployment instructions", | ||
| "Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000", | ||
| "Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8", | ||
| "Waits for server health check at /health endpoint", | ||
| "Verifies /v1/models lists the model", | ||
| "Sends test generation request to /v1/completions and confirms coherent output", | ||
| "Reports server URL (http://localhost:8000) and port to user" | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # Deployment Environment Setup | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need add multi-node support, but it can be complicated, maybe we can support vllm multi node first. |
||
|
|
||
| ## Framework Installation | ||
|
|
||
| ### vLLM | ||
|
|
||
| ```bash | ||
| pip install vllm | ||
| ``` | ||
|
|
||
| Minimum version: 0.10.1 | ||
|
|
||
| ### SGLang | ||
|
|
||
| ```bash | ||
| pip install "sglang[all]" | ||
| ``` | ||
|
|
||
| Minimum version: 0.4.10 | ||
|
|
||
| ### TRT-LLM | ||
|
|
||
| TRT-LLM is best installed via NVIDIA container: | ||
|
|
||
| ```bash | ||
| docker pull nvcr.io/nvidia/tensorrt-llm/release:<version> | ||
| ``` | ||
|
|
||
| Or via pip (requires CUDA toolkit): | ||
|
|
||
| ```bash | ||
| pip install tensorrt-llm | ||
| ``` | ||
|
|
||
| Minimum version: 0.17.0 | ||
|
|
||
| ## SLURM Deployment | ||
|
|
||
| For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line: | ||
|
|
||
| ```bash | ||
| #!/bin/bash | ||
| #SBATCH --job-name=deploy | ||
| #SBATCH --account=<account> | ||
| #SBATCH --partition=<partition> | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --ntasks-per-node=1 | ||
| #SBATCH --gpus-per-node=<num_gpus> | ||
| #SBATCH --time=04:00:00 | ||
| #SBATCH --output=deploy_%j.log | ||
|
|
||
| srun \ | ||
| --container-image="<path/to/container.sqsh>" \ | ||
| --container-mounts="<data_root>:<data_root>" \ | ||
| --container-workdir="<workdir>" \ | ||
| --no-container-mount-home \ | ||
| bash -c "python -m vllm.entrypoints.openai.api_server \ | ||
| --model <checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --tensor-parallel-size <num_gpus> \ | ||
| --host 0.0.0.0 --port 8000" | ||
| ``` | ||
|
|
||
| To access the server from outside the SLURM node, note the allocated hostname: | ||
|
|
||
| ```bash | ||
| squeue -u $USER -o "%j %N %S" # Get the node name | ||
| # Then SSH tunnel or use the node's hostname directly | ||
| ``` | ||
|
|
||
| ## Docker Deployment | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For real quant, we can use official docker of vllm/sglang/trt-llm. Maybe we can list the docker links here. |
||
|
|
||
| ### vLLM with ModelOpt | ||
|
|
||
| A Dockerfile is available at `examples/vllm_serve/Dockerfile`: | ||
|
|
||
| ```bash | ||
| docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt . | ||
|
|
||
| docker run --gpus all -p 8000:8000 vllm-modelopt \ | ||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model <checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --host 0.0.0.0 --port 8000 | ||
|
Comment on lines
+77
to
+84
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mount the checkpoint into the container in this example. This command tells vLLM to load 💡 Suggested doc fix docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
-docker run --gpus all -p 8000:8000 vllm-modelopt \
+docker run --gpus all -p 8000:8000 \
+ -v <checkpoint_dir>:<checkpoint_dir> \
+ vllm-modelopt \
python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--host 0.0.0.0 --port 8000🤖 Prompt for AI Agents |
||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also add some negative triggers, e.g., "Do NOT use for quantizing models (use ptq) or monitoring running servers (use run-and-monitor)." Similarly for the PTQ skills. @mxinO