Address review comments

kaix-nv · kaix-nv · commit b79c4e13be3d · 2026-04-01T12:04:23.000-07:00
Signed-off-by: Kai Xu &lt;kaix@nvidia.com&gt;
diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: deployment
-description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
+description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).
 license: Apache-2.0
 ---
 
@@ -79,32 +79,23 @@ Check the support matrix in `references/support-matrix.md` to confirm the model
 
 ### 3. Check the environment
 
-**GPU availability:**
+Read `skills/common/environment-setup.md` for GPU detection, local vs remote, and SLURM/Docker/bare metal detection. After completing it you should know: GPU model/count, local or remote, and execution environment.
 
-```bash
-python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')"
-```
-
-**Framework installed?**
+Then check the **deployment framework** is installed:
 
 ```bash
-# vLLM
 python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
-
-# SGLang
 python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
-
-# TRT-LLM
 python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
 ```
 
-If the framework is not installed, consult `references/setup.md` for installation instructions.
+If not installed, consult `references/setup.md`.
 
-**GPU memory estimate:**
+**GPU memory estimate** (to determine tensor parallelism):
 
-- BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB)
-- FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB)
-- FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB)
+- BF16: `params × 2 bytes` (8B ≈ 16 GB)
+- FP8: `params × 1 byte` (8B ≈ 8 GB)
+- FP4: `params × 0.5 bytes` (8B ≈ 4 GB)
 - Add ~2-4 GB for KV cache and framework overhead
 
 If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
@@ -179,24 +170,7 @@ curl -s http://localhost:8000/v1/completions \
 
 All checks must pass before reporting success to the user.
 
-### 6. Benchmark (optional)
-
-If the user wants throughput/latency numbers, run a quick benchmark:
-
-```bash
-# vLLM benchmark
-python -m vllm.entrypoints.openai.api_server ... &  # if not already running
-
-python -m vllm.benchmark_serving \
-    --model <model_name> \
-    --port 8000 \
-    --num-prompts 100 \
-    --request-rate 10
-```
-
-Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
-
-### 7. Remote deployment (SSH/SLURM)
+### 6. Remote deployment (SSH/SLURM)
 
 If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
 
@@ -219,18 +193,7 @@ If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clust
 
 3. **Deploy based on remote environment:**
 
-   - **SLURM** — write a job script that starts the server inside a container, then submit:
-
-     ```bash
-     srun --container-image="<container.sqsh>" \
-         --container-mounts="<data_root>:<data_root>" \
-         python -m vllm.entrypoints.openai.api_server \
-             --model <remote_checkpoint_path> \
-             --quantization modelopt \
-             --host 0.0.0.0 --port 8000
-     ```
-
-     Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
+   - **SLURM** — see `skills/common/slurm-setup.md` for job script templates (container setup, account/partition discovery). The server command inside the container is the same as Step 4 (e.g., `python -m vllm.entrypoints.openai.api_server --model <path> --quantization modelopt`). Use `remote_submit_job` and `remote_poll_job` to manage the job. Get the node hostname from `squeue -j $JOBID -o %N`.
 
    - **Bare metal / Docker** — use `remote_run` to start the server directly:
 
diff --git a/.claude/skills/deployment/references/setup.md b/.claude/skills/deployment/references/setup.md
@@ -70,16 +70,37 @@ squeue -u $USER -o "%j %N %S"  # Get the node name
 
 ## Docker Deployment
 
-### vLLM with ModelOpt
+### Official Images (recommended)
 
-A Dockerfile is available at `examples/vllm_serve/Dockerfile`:
+| Framework | Image | Source |
+|-----------|-------|--------|
+| vLLM | `vllm/vllm-openai:latest` | <https://hub.docker.com/r/vllm/vllm-openai> |
+| SGLang | `lmsysorg/sglang:latest` | <https://hub.docker.com/r/lmsysorg/sglang> |
+| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:latest` | <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/> |
+
+Example with the official vLLM image:
+
+```bash
+docker run --gpus all -p 8000:8000 \
+    -v /path/to/checkpoint:/model \
+    vllm/vllm-openai:latest \
+    --model /model \
+    --quantization modelopt \
+    --host 0.0.0.0 --port 8000
+```
+
+### Custom Image (optional)
+
+A Dockerfile is also available at `examples/vllm_serve/Dockerfile` if you need a custom build:
 
 ```bash
 docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
 
-docker run --gpus all -p 8000:8000 vllm-modelopt \
+docker run --gpus all -p 8000:8000 \
+    -v /path/to/checkpoint:/model \
+    vllm-modelopt \
     python -m vllm.entrypoints.openai.api_server \
-        --model <checkpoint_path> \
+        --model /model \
         --quantization modelopt \
         --host 0.0.0.0 --port 8000
 ```
diff --git a/.claude/skills/deployment/references/support-matrix.md b/.claude/skills/deployment/references/support-matrix.md
@@ -50,9 +50,16 @@
 | SGLang | `quantization="modelopt"` | `quantization="modelopt_fp4"` |
 | TRT-LLM | auto-detected from checkpoint | auto-detected from checkpoint |
 
+## Models not in this list
+
+This matrix covers officially validated combinations. For unlisted models:
+
+1. **Check the framework's own docs** — vLLM and SGLang support many HuggingFace models natively. Use WebSearch to check `vllm supported models` or `sglang supported models`.
+2. **Try it** — if the model uses standard `nn.Linear` layers and has `hf_quant_config.json`, vLLM/SGLang will likely work with `--quantization modelopt`.
+3. **Ask the user** — if unsure, ask: "This model isn't in the validated support matrix. Would you like to try deploying it anyway?"
+
 ## Notes
 
 - **NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference.
 - INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang).
-- Other models/formats may work but are not officially validated.
 - Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`
diff --git a/.claude/skills/deployment/scripts/deploy.sh b/.claude/skills/deployment/scripts/deploy.sh
@@ -127,9 +127,18 @@ print(quant_algo)
             return 1
         }
 
-        if echo "$quant_algo" | grep -qi "fp4"; then
+        if echo "$quant_algo" | grep -qi "fp4\|nvfp4"; then
             echo "modelopt_fp4"
+        elif echo "$quant_algo" | grep -qi "fp8"; then
+            echo "modelopt"
+        elif echo "$quant_algo" | grep -qi "int4_awq\|w4a8_awq\|w4a16_awq\|w8a"; then
+            log_error "Quantization format '$quant_algo' is only supported by TRT-LLM, not vLLM/SGLang"
+            log_error "Use --framework trtllm or deploy with TRT-LLM directly"
+            return 1
+        elif [[ -z "$quant_algo" ]]; then
+            echo "none"
         else
+            log_warn "Unknown quant_algo '$quant_algo' — trying --quantization modelopt"
             echo "modelopt"
         fi
     elif [[ -f "$model_path/config.json" ]]; then
diff --git a/.claude/skills/deployment/tests/evals.json b/.claude/skills/deployment/tests/evals.json
@@ -0,0 +1,58 @@
+[
+  {
+    "name": "vllm-fp8-local",
+    "skills": ["deployment"],
+    "query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM",
+    "files": [],
+    "expected_behavior": [
+      "Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint",
+      "Reads hf_quant_config.json and detects FP8 quantization format",
+      "Confirms vLLM is the chosen framework",
+      "Checks vLLM is installed and version >= 0.10.1",
+      "Detects local GPU via nvidia-smi or torch.cuda",
+      "Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU",
+      "Reads references/vllm.md for deployment instructions",
+      "Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000",
+      "Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8",
+      "Waits for server health check at /health endpoint",
+      "Verifies /v1/models lists the model",
+      "Sends test generation request to /v1/completions and confirms coherent output",
+      "Reports server URL (http://localhost:8000) and port to user"
+    ]
+  },
+  {
+    "name": "remote-slurm-deployment",
+    "skills": ["deployment"],
+    "query": "deploy my quantized model on the SLURM cluster",
+    "files": [],
+    "expected_behavior": [
+      "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml",
+      "Sources .claude/skills/common/remote_exec.sh",
+      "Calls remote_load_cluster, remote_check_ssh, remote_detect_env",
+      "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local",
+      "For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)",
+      "Starts vLLM/SGLang server inside the container via srun",
+      "Gets allocated node hostname from squeue -j $JOBID -o %N",
+      "Verifies remotely: remote_run 'curl -s http://localhost:8000/health'",
+      "Reports the remote endpoint (http://<node_hostname>:8000) and notes SLURM network restrictions",
+      "Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags"
+    ]
+  },
+  {
+    "name": "unquantized-hf-model",
+    "skills": ["deployment"],
+    "query": "deploy Qwen/Qwen3-0.6B with vLLM",
+    "files": [],
+    "expected_behavior": [
+      "Identifies Qwen/Qwen3-0.6B as a HuggingFace model ID (not a local path)",
+      "Detects no quantization format in the model name — treats as unquantized (BF16)",
+      "Does not pass --quantization flag to vLLM",
+      "Checks vLLM is installed and GPU is available",
+      "Estimates memory: 0.6B params x 2 bytes = ~1.2 GB, fits single GPU",
+      "Starts vLLM server: python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000",
+      "Waits for health check at /health endpoint",
+      "Tests generation via /v1/completions and confirms coherent output",
+      "Reports server URL to user"
+    ]
+  }
+]
diff --git a/.claude/skills/deployment/tests/remote-slurm-deployment.json b/.claude/skills/deployment/tests/remote-slurm-deployment.json
diff --git a/.claude/skills/deployment/tests/vllm-fp8-local.json b/.claude/skills/deployment/tests/vllm-fp8-local.json