NVIDIA · kaix-nv · Mar 28, 2026 · Mar 29, 2026 · Edwardf0t1 · Mar 31, 2026
diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
@@ -0,0 +1,268 @@
+---
+name: deployment
+description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
+license: Apache-2.0
+---
+
+# Deployment Skill
+
+Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
+
+## Quick Start
+
+Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.
+
+```bash
+# Start vLLM server with a ModelOpt checkpoint
+scripts/deploy.sh start --model ./qwen3-0.6b-fp8
+
+# Start with SGLang and tensor parallelism
+scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
+
+# Start from HuggingFace hub
+scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
+
+# Test the API
+scripts/deploy.sh test
+
+# Check status
+scripts/deploy.sh status
+
+# Stop
+scripts/deploy.sh stop
+```
+
+The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.
+
+## Decision Flow
+
+### 0. Check workspace (multi-user / Slack bot)
+
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
+
+```bash
+ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
+```
+
+If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
+
+### 1. Identify the checkpoint
+
+Determine what the user wants to deploy:
+
+- **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command.
+- **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly
+- **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill
+
+> **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`.
+
+Check the quantization format if applicable:
+
+```bash
+cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
+```
+
+If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized.
+
+### 2. Choose the framework
+
+If the user hasn't specified a framework, recommend based on this priority:
+
+| Situation | Recommended | Why |
+|-----------|-------------|-----|
+| General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible |
+| Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support |
+| Maximum optimization | **TRT-LLM** | Best throughput via engine compilation |
+| Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints |
+
+Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported.
+
+### 3. Check the environment
+
+**GPU availability:**
+
+```bash
+python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')"
+```
+
+**Framework installed?**
+
+```bash
+# vLLM
+python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
+
+# SGLang
+python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
+
+# TRT-LLM
+python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
+```
+
+If the framework is not installed, consult `references/setup.md` for installation instructions.
+
+**GPU memory estimate:**
+
+- BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB)
+- FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB)
+- FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB)
+- Add ~2-4 GB for KV cache and framework overhead
+
+If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
+
+### 4. Deploy
+
+Read the framework-specific reference for detailed instructions:
+
+| Framework | Reference file |
+|-----------|---------------|
+| vLLM | `references/vllm.md` |
+| SGLang | `references/sglang.md` |
+| TRT-LLM | `references/trtllm.md` |
+
+**Quick-start commands** (for common cases):
+
+#### vLLM
+
+```bash
+# Serve as OpenAI-compatible endpoint
+python -m vllm.entrypoints.openai.api_server \
+    --model <checkpoint_path> \
+    --quantization modelopt \
+    --tensor-parallel-size <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
+
+#### SGLang
+
+```bash
+python -m sglang.launch_server \
+    --model-path <checkpoint_path> \
+    --quantization modelopt \
+    --tp <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+#### TRT-LLM (direct)
+
+```python
+from tensorrt_llm import LLM, SamplingParams
+llm = LLM(model="<checkpoint_path>")
+outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
+```
+
+#### TRT-LLM AutoDeploy
+
+For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`.
+
+### 5. Verify the deployment
+
+After the server starts, verify it's healthy:
+
+```bash
+# Health check
+curl -s http://localhost:8000/health
+
+# List models
+curl -s http://localhost:8000/v1/models | python -m json.tool
+
+# Test generation
+curl -s http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "<model_name>",
+        "prompt": "The capital of France is",
+        "max_tokens": 32
+    }' | python -m json.tool
+```
+
+All checks must pass before reporting success to the user.
+
+### 6. Benchmark (optional)
+
+If the user wants throughput/latency numbers, run a quick benchmark:
+
+```bash
+# vLLM benchmark
+python -m vllm.entrypoints.openai.api_server ... &  # if not already running
+
+python -m vllm.benchmark_serving \
+    --model <model_name> \
+    --port 8000 \
+    --num-prompts 100 \
+    --request-rate 10
+```
+
+Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
+
+### 7. Remote deployment (SSH/SLURM)
+
+If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
+
+1. **Source remote utilities:**
+
+   ```bash
+   source .claude/skills/common/remote_exec.sh
+   remote_load_cluster
+   remote_check_ssh
+   remote_detect_env
+   ```
+
+2. **Sync the checkpoint** (only if it was produced locally):
+
+   If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
+
+   ```bash
+   remote_sync_to <local_checkpoint_path> checkpoints/
+   ```
+
+3. **Deploy based on remote environment:**
+
+   - **SLURM** — write a job script that starts the server inside a container, then submit:
+
+     ```bash
+     srun --container-image="<container.sqsh>" \
+         --container-mounts="<data_root>:<data_root>" \
+         python -m vllm.entrypoints.openai.api_server \
+             --model <remote_checkpoint_path> \
+             --quantization modelopt \
+             --host 0.0.0.0 --port 8000
+     ```
+
+     Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
+
+   - **Bare metal / Docker** — use `remote_run` to start the server directly:
+
+     ```bash
+     remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
+     ```
+
+4. **Verify remotely:**
+
+   ```bash
+   remote_run "curl -s http://localhost:8000/health"
+   remote_run "curl -s http://localhost:8000/v1/models"
+   ```
+
+5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.
+
+For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
+
+## Error Handling
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
+| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
+| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
+| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
+| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
+
+## Success Criteria
+
+1. Server process is running and healthy (`/health` returns 200)
+2. Model is listed at `/v1/models`
+3. Test generation produces coherent output
+4. Server URL and port are reported to the user
+5. If benchmarking was requested, throughput/latency numbers are reported
diff --git a/.claude/skills/deployment/evals/remote-slurm-deployment.json b/.claude/skills/deployment/evals/remote-slurm-deployment.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["deployment"],
+  "query": "deploy my quantized model on the SLURM cluster",
+  "files": [],
+  "expected_behavior": [
+    "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml",
+    "Sources .claude/skills/common/remote_exec.sh",
+    "Calls remote_load_cluster, remote_check_ssh, remote_detect_env",
+    "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local",
+    "For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)",
+    "Starts vLLM/SGLang server inside the container via srun",
+    "Gets allocated node hostname from squeue -j $JOBID -o %N",
+    "Verifies remotely: remote_run 'curl -s http://localhost:8000/health'",
+    "Reports the remote endpoint (http://<node_hostname>:8000) and notes SLURM network restrictions",
+    "Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags"
+  ]
+}
diff --git a/.claude/skills/deployment/evals/vllm-fp8-local.json b/.claude/skills/deployment/evals/vllm-fp8-local.json
@@ -0,0 +1,20 @@
+{
+  "skills": ["deployment"],
+  "query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM",
+  "files": [],
+  "expected_behavior": [
+    "Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint",
+    "Reads hf_quant_config.json and detects FP8 quantization format",
+    "Confirms vLLM is the chosen framework",
+    "Checks vLLM is installed and version >= 0.10.1",
+    "Detects local GPU via nvidia-smi or torch.cuda",
+    "Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU",
+    "Reads references/vllm.md for deployment instructions",
+    "Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000",
+    "Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8",
+    "Waits for server health check at /health endpoint",
+    "Verifies /v1/models lists the model",
+    "Sends test generation request to /v1/completions and confirms coherent output",
+    "Reports server URL (http://localhost:8000) and port to user"
+  ]
+}
diff --git a/.claude/skills/deployment/references/setup.md b/.claude/skills/deployment/references/setup.md
@@ -0,0 +1,85 @@
+# Deployment Environment Setup
+
+## Framework Installation
+
+### vLLM
+
+```bash
+pip install vllm
+```
+
+Minimum version: 0.10.1
+
+### SGLang
+
+```bash
+pip install "sglang[all]"
+```
+
+Minimum version: 0.4.10
+
+### TRT-LLM
+
+TRT-LLM is best installed via NVIDIA container:
+
+```bash
+docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
+```
+
+Or via pip (requires CUDA toolkit):
+
+```bash
+pip install tensorrt-llm
+```
+
+Minimum version: 0.17.0
+
+## SLURM Deployment
+
+For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=deploy
+#SBATCH --account=<account>
+#SBATCH --partition=<partition>
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=<num_gpus>
+#SBATCH --time=04:00:00
+#SBATCH --output=deploy_%j.log
+
+srun \
+    --container-image="<path/to/container.sqsh>" \
+    --container-mounts="<data_root>:<data_root>" \
+    --container-workdir="<workdir>" \
+    --no-container-mount-home \
+    bash -c "python -m vllm.entrypoints.openai.api_server \
+        --model <checkpoint_path> \
+        --quantization modelopt \
+        --tensor-parallel-size <num_gpus> \
+        --host 0.0.0.0 --port 8000"
+```
+
+To access the server from outside the SLURM node, note the allocated hostname:
+
+```bash
+squeue -u $USER -o "%j %N %S"  # Get the node name
+# Then SSH tunnel or use the node's hostname directly
+```
+
+## Docker Deployment
+
+### vLLM with ModelOpt
+
+A Dockerfile is available at `examples/vllm_serve/Dockerfile`:
+
+```bash
+docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
+
+docker run --gpus all -p 8000:8000 vllm-modelopt \
+    python -m vllm.entrypoints.openai.api_server \
+        --model <checkpoint_path> \
+        --quantization modelopt \
+        --host 0.0.0.0 --port 8000
+```