diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md new file mode 100644 index 0000000000..c621042953 --- /dev/null +++ b/.claude/skills/deployment/SKILL.md @@ -0,0 +1,268 @@ +--- +name: deployment +description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. +license: Apache-2.0 +--- + +# Deployment Skill + +Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy). + +## Quick Start + +Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment. + +```bash +# Start vLLM server with a ModelOpt checkpoint +scripts/deploy.sh start --model ./qwen3-0.6b-fp8 + +# Start with SGLang and tensor parallelism +scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4 + +# Start from HuggingFace hub +scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8 + +# Test the API +scripts/deploy.sh test + +# Check status +scripts/deploy.sh status + +# Stop +scripts/deploy.sh stop +``` + +The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing. + +## Decision Flow + +### 0. Check workspace (multi-user / Slack bot) + +If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run: + +```bash +ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null +``` + +If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory. + +### 1. Identify the checkpoint + +Determine what the user wants to deploy: + +- **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command. +- **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly +- **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill + +> **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`. + +Check the quantization format if applicable: + +```bash +cat /hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json" +``` + +If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized. + +### 2. Choose the framework + +If the user hasn't specified a framework, recommend based on this priority: + +| Situation | Recommended | Why | +|-----------|-------------|-----| +| General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible | +| Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support | +| Maximum optimization | **TRT-LLM** | Best throughput via engine compilation | +| Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints | + +Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported. + +### 3. Check the environment + +**GPU availability:** + +```bash +python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')" +``` + +**Framework installed?** + +```bash +# vLLM +python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed" + +# SGLang +python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed" + +# TRT-LLM +python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed" +``` + +If the framework is not installed, consult `references/setup.md` for installation instructions. + +**GPU memory estimate:** + +- BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB) +- FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB) +- FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB) +- Add ~2-4 GB for KV cache and framework overhead + +If the model exceeds single GPU memory, use tensor parallelism (`-tp `). + +### 4. Deploy + +Read the framework-specific reference for detailed instructions: + +| Framework | Reference file | +|-----------|---------------| +| vLLM | `references/vllm.md` | +| SGLang | `references/sglang.md` | +| TRT-LLM | `references/trtllm.md` | + +**Quick-start commands** (for common cases): + +#### vLLM + +```bash +# Serve as OpenAI-compatible endpoint +python -m vllm.entrypoints.openai.api_server \ + --model \ + --quantization modelopt \ + --tensor-parallel-size \ + --host 0.0.0.0 --port 8000 +``` + +For NVFP4 checkpoints, use `--quantization modelopt_fp4`. + +#### SGLang + +```bash +python -m sglang.launch_server \ + --model-path \ + --quantization modelopt \ + --tp \ + --host 0.0.0.0 --port 8000 +``` + +#### TRT-LLM (direct) + +```python +from tensorrt_llm import LLM, SamplingParams +llm = LLM(model="") +outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95)) +``` + +#### TRT-LLM AutoDeploy + +For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`. + +### 5. Verify the deployment + +After the server starts, verify it's healthy: + +```bash +# Health check +curl -s http://localhost:8000/health + +# List models +curl -s http://localhost:8000/v1/models | python -m json.tool + +# Test generation +curl -s http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "prompt": "The capital of France is", + "max_tokens": 32 + }' | python -m json.tool +``` + +All checks must pass before reporting success to the user. + +### 6. Benchmark (optional) + +If the user wants throughput/latency numbers, run a quick benchmark: + +```bash +# vLLM benchmark +python -m vllm.entrypoints.openai.api_server ... & # if not already running + +python -m vllm.benchmark_serving \ + --model \ + --port 8000 \ + --num-prompts 100 \ + --request-rate 10 +``` + +Report: throughput (tok/s), latency p50/p99, time to first token (TTFT). + +### 7. Remote deployment (SSH/SLURM) + +If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine: + +1. **Source remote utilities:** + + ```bash + source .claude/skills/common/remote_exec.sh + remote_load_cluster + remote_check_ssh + remote_detect_env + ``` + +2. **Sync the checkpoint** (only if it was produced locally): + + If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls /config.json"`. Only sync if the checkpoint is local: + + ```bash + remote_sync_to checkpoints/ + ``` + +3. **Deploy based on remote environment:** + + - **SLURM** — write a job script that starts the server inside a container, then submit: + + ```bash + srun --container-image="" \ + --container-mounts=":" \ + python -m vllm.entrypoints.openai.api_server \ + --model \ + --quantization modelopt \ + --host 0.0.0.0 --port 8000 + ``` + + Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`. + + - **Bare metal / Docker** — use `remote_run` to start the server directly: + + ```bash + remote_run "nohup python -m vllm.entrypoints.openai.api_server --model --port 8000 > deploy.log 2>&1 &" + ``` + +4. **Verify remotely:** + + ```bash + remote_run "curl -s http://localhost:8000/health" + remote_run "curl -s http://localhost:8000/v1/models" + ``` + +5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://:8000`). For SLURM, note that the port is only reachable from within the cluster network. + +For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically. + +## Error Handling + +| Error | Cause | Fix | +|-------|-------|-----| +| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model | +| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 | +| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag | +| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors | +| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` | + +## Success Criteria + +1. Server process is running and healthy (`/health` returns 200) +2. Model is listed at `/v1/models` +3. Test generation produces coherent output +4. Server URL and port are reported to the user +5. If benchmarking was requested, throughput/latency numbers are reported diff --git a/.claude/skills/deployment/references/setup.md b/.claude/skills/deployment/references/setup.md new file mode 100644 index 0000000000..4209f08647 --- /dev/null +++ b/.claude/skills/deployment/references/setup.md @@ -0,0 +1,85 @@ +# Deployment Environment Setup + +## Framework Installation + +### vLLM + +```bash +pip install vllm +``` + +Minimum version: 0.10.1 + +### SGLang + +```bash +pip install "sglang[all]" +``` + +Minimum version: 0.4.10 + +### TRT-LLM + +TRT-LLM is best installed via NVIDIA container: + +```bash +docker pull nvcr.io/nvidia/tensorrt-llm/release: +``` + +Or via pip (requires CUDA toolkit): + +```bash +pip install tensorrt-llm +``` + +Minimum version: 0.17.0 + +## SLURM Deployment + +For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line: + +```bash +#!/bin/bash +#SBATCH --job-name=deploy +#SBATCH --account= +#SBATCH --partition= +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=1 +#SBATCH --gpus-per-node= +#SBATCH --time=04:00:00 +#SBATCH --output=deploy_%j.log + +srun \ + --container-image="" \ + --container-mounts=":" \ + --container-workdir="" \ + --no-container-mount-home \ + bash -c "python -m vllm.entrypoints.openai.api_server \ + --model \ + --quantization modelopt \ + --tensor-parallel-size \ + --host 0.0.0.0 --port 8000" +``` + +To access the server from outside the SLURM node, note the allocated hostname: + +```bash +squeue -u $USER -o "%j %N %S" # Get the node name +# Then SSH tunnel or use the node's hostname directly +``` + +## Docker Deployment + +### vLLM with ModelOpt + +A Dockerfile is available at `examples/vllm_serve/Dockerfile`: + +```bash +docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt . + +docker run --gpus all -p 8000:8000 vllm-modelopt \ + python -m vllm.entrypoints.openai.api_server \ + --model \ + --quantization modelopt \ + --host 0.0.0.0 --port 8000 +``` diff --git a/.claude/skills/deployment/references/sglang.md b/.claude/skills/deployment/references/sglang.md new file mode 100644 index 0000000000..62d5c57b59 --- /dev/null +++ b/.claude/skills/deployment/references/sglang.md @@ -0,0 +1,81 @@ +# SGLang Deployment Reference + +## Requirements + +- SGLang >= 0.4.10 +- `pip install sglang[all]` + +## Server Deployment + +### As OpenAI-compatible server + +```bash +python -m sglang.launch_server \ + --model-path \ + --quantization modelopt \ + --tp \ + --host 0.0.0.0 --port 8000 +``` + +For NVFP4 checkpoints, use `--quantization modelopt_fp4`. + +### As Python API + +```python +import sglang as sgl + +llm = sgl.Engine(model_path="", quantization="modelopt") +# For FP4: quantization="modelopt_fp4" + +sampling_params = {"temperature": 0.8, "top_p": 0.95} +outputs = llm.generate(["Hello, my name is"], sampling_params) + +for output in outputs: + print(f"Generated: {output['text']}") +``` + +### From HuggingFace Hub + +```python +import sglang as sgl + +llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt") +outputs = llm.generate(["What is AI?"], {"temperature": 0.8}) +``` + +## Speculative Decoding + +SGLang supports speculative decoding with EAGLE and EAGLE3 models: + +```bash +python -m sglang.launch_server \ + --model-path \ + --speculative-algorithm EAGLE \ + --speculative-draft-model-path \ + --speculative-num-steps 3 \ + --speculative-eagle-topk 4 \ + --tp \ + --host 0.0.0.0 --port 8000 +``` + +Reference: `examples/specdec_bench/specdec_bench/models/sglang.py` + +## Key SGLang Flags + +| Flag | Description | +|------|-------------| +| `--model-path` | Path to checkpoint or HF model ID | +| `--quantization` | `modelopt` (FP8) or `modelopt_fp4` (FP4) | +| `--tp` | Tensor parallelism size | +| `--ep` | Expert parallelism (for MoE models) | +| `--enable-torch-compile` | Enable torch.compile for better perf | +| `--cuda-graph-max-bs` | Max batch size for CUDA graphs | +| `--attention-backend` | `flashinfer` (default) or `triton` | + +## Common Issues + +| Issue | Fix | +|-------|-----| +| `quantization="modelopt"` not recognized | Upgrade SGLang to >= 0.4.10 | +| DeepSeek FP4 not working | Check support matrix — SGLang FP4 support varies by model | +| OOM on startup | Increase `--tp` or reduce `--max-total-tokens` | diff --git a/.claude/skills/deployment/references/support-matrix.md b/.claude/skills/deployment/references/support-matrix.md new file mode 100644 index 0000000000..8d0a671537 --- /dev/null +++ b/.claude/skills/deployment/references/support-matrix.md @@ -0,0 +1,58 @@ +# Deployment Support Matrix + +## Unified HF Checkpoint — Framework Compatibility + +| Model | Quant Format | TRT-LLM | vLLM | SGLang | +|-------|-------------|---------|------|--------| +| Llama 3.x | FP8 | yes | yes | yes | +| Llama 3.x | FP4 | yes | yes | yes | +| Llama 4 | FP8 | yes | — | yes | +| Llama 4 | FP4 | yes | — | — | +| DeepSeek R1 | FP8 | yes | yes | yes | +| DeepSeek R1 | FP4 | yes | yes | yes | +| DeepSeek V3 | FP8 | yes | yes | yes | +| DeepSeek V3 | FP4 | yes | yes | yes | +| Qwen 3 | FP8 | yes | yes | yes | +| Qwen 3 | FP4 | yes | yes | — | +| Qwen 3 MoE | FP8 | yes | yes | yes | +| Qwen 3 MoE | FP4 | yes | — | — | +| Qwen 2.5 | FP8 | yes | yes | yes | +| Qwen 2.5 | FP4 | yes | yes | — | +| QwQ-32B | FP8 | yes | yes | yes | +| QwQ-32B | FP4 | yes | yes | — | +| Mixtral 8x7B | FP8 | yes | yes | yes | +| Mixtral 8x7B | FP4 | yes | — | — | + +## Supported Quantization Formats + +| Format | Description | +|--------|-------------| +| FP8 | 8-bit floating point (E4M3) | +| FP8_PB | 8-bit floating point with per-block scaling | +| NVFP4 | NVIDIA 4-bit floating point | +| NVFP4_AWQ | NVIDIA 4-bit floating point with AWQ optimization | +| INT4_AWQ | 4-bit integer with AWQ (TRT-LLM only) | +| W4A8_AWQ | 4-bit weights, 8-bit activations with AWQ (TRT-LLM only) | + +## Minimum Framework Versions + +| Framework | Minimum Version | +|-----------|----------------| +| TensorRT-LLM | v0.17.0 | +| vLLM | v0.10.1 | +| SGLang | v0.4.10 | + +## Quantization Flag by Framework + +| Framework | FP8 flag | FP4 flag | +|-----------|----------|----------| +| vLLM | `quantization="modelopt"` | `quantization="modelopt_fp4"` | +| SGLang | `quantization="modelopt"` | `quantization="modelopt_fp4"` | +| TRT-LLM | auto-detected from checkpoint | auto-detected from checkpoint | + +## Notes + +- **NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference. +- INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang). +- Other models/formats may work but are not officially validated. +- Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst` diff --git a/.claude/skills/deployment/references/trtllm.md b/.claude/skills/deployment/references/trtllm.md new file mode 100644 index 0000000000..5725bed3bf --- /dev/null +++ b/.claude/skills/deployment/references/trtllm.md @@ -0,0 +1,109 @@ +# TRT-LLM Deployment Reference + +## Requirements + +- TensorRT-LLM >= 0.17.0 +- Typically installed via NVIDIA container: `nvcr.io/nvidia/tensorrt-llm/release:` +- Or: `pip install tensorrt-llm` + +## Direct LLM API (recommended for unified HF checkpoints) + +### Python API + +```python +from tensorrt_llm import LLM, SamplingParams + +llm = LLM(model="") +# Quantization format is auto-detected from hf_quant_config.json + +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +outputs = llm.generate(["Hello, my name is"], sampling_params) + +for output in outputs: + print(f"Prompt: {output.prompt!r}, Generated: {output.outputs[0].text!r}") +``` + +### From HuggingFace Hub + +```python +from tensorrt_llm import LLM + +llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8") +print(llm.generate(["What is AI?"])) +``` + +### With tensor parallelism + +```python +from tensorrt_llm import LLM + +llm = LLM(model="", tensor_parallel_size=4) +``` + +## AutoDeploy (for AutoQuant / mixed-precision) + +AutoDeploy automates graph transformations for optimized inference. Required for AutoQuant checkpoints. + +### End-to-end script + +```bash +# Quantize and deploy in one step +./examples/llm_autodeploy/scripts/run_auto_quant_and_deploy.sh \ + --hf_ckpt \ + --save_quantized_ckpt \ + --quant fp8,nvfp4 \ + --effective_bits 4.5 +``` + +Parameters: + +- `--hf_ckpt`: Path to unquantized HuggingFace checkpoint +- `--save_quantized_ckpt`: Output path for quantized checkpoint +- `--quant`: Quantization formats (e.g., `fp8,nvfp4`) +- `--effective_bits`: Target precision (higher = more accuracy for sensitive layers) +- `--world_size`: Number of GPUs for tensor parallelism +- `--calib_batch_size`: Calibration batch size (reduce if OOM, default 8) + +### AutoDeploy API server + +```python +# examples/llm_autodeploy/api_server.py provides a FastAPI server +# with OpenAI-compatible endpoints using AutoDeploy +``` + +### Test AutoDeploy + +```bash +python examples/llm_autodeploy/api_client.py --prompt "What is AI?" "What is golf?" +``` + +### Notes + +- NVFP4 in AutoDeploy requires Blackwell GPUs +- For Hopper: remove `nvfp4` from `--quant` and set `--effective_bits` above 8.0 +- AutoDeploy supports CUDA graphs, torch compile backends, and KV cache optimization + +## Legacy TRT-LLM Checkpoint (deprecated) + +The legacy export path using `export_tensorrt_llm_checkpoint()` is deprecated. Use the unified HF checkpoint format with `export_hf_checkpoint()` instead. + +If you encounter a legacy checkpoint (no `hf_quant_config.json`, has `rank*.safetensors` pattern), it needs the TRT-LLM build API to create an engine before deployment. See `docs/source/deployment/1_tensorrt_llm.rst`. + +## Evaluation with TRT-LLM + +```python +# examples/llm_eval/lm_eval_tensorrt_llm.py +# Runs lm_evaluation_harness benchmarks with TRT-LLM +python examples/llm_eval/lm_eval_tensorrt_llm.py \ + --model_path \ + --tasks gsm8k,mmlu +``` + +## Common Issues + +| Issue | Fix | +|-------|-----| +| `No module named tensorrt_llm` | Install via container or pip | +| NVFP4 inference fails on Hopper | NVFP4 requires Blackwell GPUs for inference | +| Slow first inference | Engine compilation happens on first run; subsequent runs are cached | +| OOM during engine build | Reduce `--max_batch_size` or increase TP | diff --git a/.claude/skills/deployment/references/vllm.md b/.claude/skills/deployment/references/vllm.md new file mode 100644 index 0000000000..89e06bde42 --- /dev/null +++ b/.claude/skills/deployment/references/vllm.md @@ -0,0 +1,91 @@ +# vLLM Deployment Reference + +## Requirements + +- vLLM >= 0.10.1 +- `pip install vllm` + +## Realquant Deployment (recommended) + +Realquant uses dedicated quantized kernels for maximum performance. This is the default path for ModelOpt-exported checkpoints. + +### As OpenAI-compatible server + +```bash +python -m vllm.entrypoints.openai.api_server \ + --model \ + --quantization modelopt \ + --tensor-parallel-size \ + --host 0.0.0.0 --port 8000 \ + --served-model-name +``` + +For NVFP4 checkpoints, use `--quantization modelopt_fp4`. + +### As Python API + +```python +from vllm import LLM, SamplingParams + +llm = LLM(model="", quantization="modelopt") +# For FP4: quantization="modelopt_fp4" + +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) +outputs = llm.generate(["Hello, my name is"], sampling_params) + +for output in outputs: + print(f"Prompt: {output.prompt!r}, Generated: {output.outputs[0].text!r}") +``` + +### From HuggingFace Hub + +```python +from vllm import LLM, SamplingParams + +llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt") +outputs = llm.generate(["What is AI?"], SamplingParams(temperature=0.8)) +``` + +## Fakequant Deployment (research) + +Fakequant is 2-5x slower than realquant but doesn't require dedicated kernel support. Useful for research and testing new quantization schemes. + +Reference: `examples/vllm_serve/` + +```bash +# Environment variables for configuration +export QUANT_CFG=NVFP4_DEFAULT_CFG # Quantization format +export QUANT_CALIB_SIZE=512 # Calibration samples +export QUANT_DATASET=cnn_dailymail # Calibration dataset + +python examples/vllm_serve/vllm_serve_fakequant.py \ + -tp --host 0.0.0.0 --port 8000 +``` + +## Benchmarking + +```bash +# Start server first, then benchmark +python -m vllm.benchmark_serving \ + --model \ + --port 8000 \ + --num-prompts 100 \ + --request-rate 10 +``` + +Or use lm_eval for accuracy: + +```bash +lm_eval --model local-completions \ + --tasks gsm8k \ + --model_args model=,base_url=http://localhost:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=128 +``` + +## Common Issues + +| Issue | Fix | +|-------|-----| +| `quantization="modelopt"` not recognized | Upgrade vLLM to >= 0.10.1 | +| OOM on startup | Increase `--tensor-parallel-size` or reduce `--max-model-len` | +| AWQ checkpoints not loading | AWQ is not supported in vLLM via modelopt path; use FP8 or NVFP4 | +| Mixed precision not working | Not supported for fakequant | diff --git a/.claude/skills/deployment/scripts/deploy.sh b/.claude/skills/deployment/scripts/deploy.sh new file mode 100755 index 0000000000..3147a73dfd --- /dev/null +++ b/.claude/skills/deployment/scripts/deploy.sh @@ -0,0 +1,581 @@ +#!/bin/bash + +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ModelOpt Deployment Script +# Deploy quantized or unquantized models via vLLM, SGLang, or TRT-LLM +# Supports ModelOpt FP8/FP4 checkpoints with automatic quantization flag detection + +set -eo pipefail + +# Default configuration +MODEL="" +PORT=8000 +HOST="0.0.0.0" +FRAMEWORK="vllm" +TP_SIZE=1 +VRAM=0.9 +MAX_WAIT=300 # 5 min for large models +QUANTIZATION="" # auto-detected from checkpoint + +# Paths +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +LOG_DIR="${LOG_DIR:-/tmp/modelopt-deploy}" +LOG_FILE="$LOG_DIR/server.log" +PID_FILE="$LOG_DIR/server.pid" +META_FILE="$LOG_DIR/server.meta" # persists model/framework/port for status + +# Colors +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +log_info() { printf "${BLUE}[INFO]${NC} %s\n" "$1" >&2; } +log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1" >&2; } +log_warn() { printf "${YELLOW}[WARN]${NC} %s\n" "$1" >&2; } +log_error() { printf "${RED}[ERROR]${NC} %s\n" "$1" >&2; } + +usage() { + cat < [OPTIONS] + +Commands: + start - Start the inference server + stop - Stop the inference server + test - Test the API endpoint + status - Show server status + restart - Restart the server + detect - Detect checkpoint format (without starting) + +Options: + --model PATH Model path or HF model ID (required for start) + --framework FRAMEWORK vllm, sglang, or trtllm (default: vllm) + --port PORT Server port (default: 8000) + --tp SIZE Tensor parallel size (default: 1) + --quantization QUANT Force quantization flag (modelopt, modelopt_fp4, or none) + --gpu-memory-utilization GPU memory utilization 0.0-1.0 (default: 0.9) + --log-dir DIR Log directory (default: /tmp/modelopt-deploy) + +Examples: + $0 start --model ./qwen3-0.6b-fp8 + $0 start --model ./llama-70b-nvfp4 --framework sglang --tp 4 + $0 start --model nvidia/Llama-3.1-8B-Instruct-FP8 --framework vllm + $0 test --port 8000 + $0 stop +EOF + exit 1 +} + +# ─── Checkpoint Detection ─────────────────────────────────────────── + +detect_quantization() { + local model_path="$1" + + # Skip detection for HF model IDs (no local path) + if [[ ! -d "$model_path" ]]; then + log_info "Model is a HF ID — using name-based heuristic for quantization flag" + # Best-effort: infer from model name. This is a fallback; local checkpoints + # use hf_quant_config.json which is reliable. + if echo "$model_path" | grep -qi "fp8"; then + log_info "HF model name contains 'fp8' — assuming modelopt quantization" + echo "modelopt" + elif echo "$model_path" | grep -qi "fp4\|nvfp4"; then + log_info "HF model name contains 'fp4/nvfp4' — assuming modelopt_fp4 quantization" + echo "modelopt_fp4" + else + log_info "No quantization format detected in model name — treating as unquantized" + echo "none" + fi + return + fi + + # Require python3 for JSON parsing + if ! command -v python3 &>/dev/null; then + log_error "python3 is required to detect quantization format but is not installed" + return 1 + fi + + # Local checkpoint: check hf_quant_config.json + local quant_config="$model_path/hf_quant_config.json" + if [[ -f "$quant_config" ]]; then + log_info "Found hf_quant_config.json" + + local quant_algo + quant_algo=$(python3 -c " +import json, sys +with open(sys.argv[1]) as f: + cfg = json.load(f) +quant_algo = cfg.get('quantization', {}).get('quant_algo', '') +print(quant_algo) +" "$quant_config" 2>&1) || { + log_error "Failed to parse hf_quant_config.json: $quant_algo" + return 1 + } + + if echo "$quant_algo" | grep -qi "fp4"; then + echo "modelopt_fp4" + else + echo "modelopt" + fi + elif [[ -f "$model_path/config.json" ]]; then + # Fallback: check config.json for quantization_config with quant_method=modelopt + local quant_method + quant_method=$(python3 -c " +import json, sys +with open(sys.argv[1]) as f: + cfg = json.load(f) +qc = cfg.get('quantization_config', {}) +if qc.get('quant_method') == 'modelopt': + print(qc.get('quant_algo', 'fp8')) +" "$model_path/config.json" 2>&1) || { + log_error "Failed to parse config.json: $quant_method" + return 1 + } + if [[ -n "$quant_method" ]]; then + log_info "Found quantization_config in config.json (quant_method=modelopt)" + if echo "$quant_method" | grep -qi "fp4"; then + echo "modelopt_fp4" + else + echo "modelopt" + fi + else + log_info "No quantization config found — treating as unquantized" + echo "none" + fi + else + log_info "No hf_quant_config.json or config.json found — treating as unquantized" + echo "none" + fi +} + +detect_gpu() { + if command -v nvidia-smi &>/dev/null; then + local gpu_count + gpu_count=$(nvidia-smi -L 2>/dev/null | wc -l) + local gpu_name + gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) + log_info "GPUs: ${gpu_count}x ${gpu_name}" + echo "$gpu_count" + else + log_error "No NVIDIA GPU detected (nvidia-smi not found)" + return 1 + fi +} + +# ─── Server Management ────────────────────────────────────────────── + +is_server_running() { + if [[ -f "$PID_FILE" ]]; then + local pid + pid=$(cat "$PID_FILE" 2>/dev/null) + if [[ ! "$pid" =~ ^[0-9]+$ ]]; then + rm -f "$PID_FILE" + return 1 + fi + # Verify the PID is actually a Python/vLLM/SGLang process (not PID reuse) + local cmdline + cmdline=$(ps -p "$pid" -o args= 2>/dev/null) || { rm -f "$PID_FILE"; return 1; } + if echo "$cmdline" | grep -q "vllm\|sglang\|python"; then + return 0 + fi + # PID exists but is not our server — stale PID file + rm -f "$PID_FILE" + fi + return 1 +} + +start_server() { + # Validate GPU availability and TP size + local gpu_count + gpu_count=$(detect_gpu) || exit 1 + if [[ "$TP_SIZE" -gt "$gpu_count" ]]; then + log_error "Requested TP size ($TP_SIZE) exceeds available GPUs ($gpu_count)" + exit 1 + fi + + if [[ -z "$MODEL" ]]; then + log_error "--model is required" + usage + fi + + if is_server_running; then + log_warn "Server already running (PID: $(cat "$PID_FILE"))" + return 0 + fi + + # Check if port is already in use + if ss -tlnp 2>/dev/null | grep -q ":${PORT} " || \ + lsof -i ":${PORT}" -sTCP:LISTEN >/dev/null 2>&1; then + log_error "Port $PORT is already in use — stop the existing service or use --port " + exit 1 + fi + + mkdir -p "$LOG_DIR" + + # Auto-detect quantization if not forced + if [[ -z "$QUANTIZATION" ]]; then + if ! QUANTIZATION=$(detect_quantization "$MODEL"); then + log_error "Failed to detect quantization — fix the checkpoint or use --quantization to override" + exit 1 + fi + fi + log_info "Quantization: $QUANTIZATION" + + # Save metadata for status command (values single-quoted for safe reading) + cat >"$META_FILE" <"$LOG_FILE" 2>&1 & + echo $! >"$PID_FILE" + + # Check for immediate crash (missing module, port conflict, CUDA error) + sleep 2 + if ! ps -p "$(cat "$PID_FILE")" >/dev/null 2>&1; then + log_error "Server process exited immediately. Last log lines:" + tail -20 "$LOG_FILE" 2>/dev/null + rm -f "$PID_FILE" + exit 1 + fi + log_success "vLLM started (PID: $(cat "$PID_FILE"))" +} + +start_sglang() { + log_info "Starting SGLang server..." + + local -a cmd=(python3 -m sglang.launch_server + --model-path "$MODEL" + --host "$HOST" --port "$PORT" + --tp "$TP_SIZE") + + if [[ "$QUANTIZATION" != "none" ]]; then + cmd+=(--quantization "$QUANTIZATION") + fi + + log_info "Command: ${cmd[*]}" + nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & + echo $! >"$PID_FILE" + + # Check for immediate crash + sleep 2 + if ! ps -p "$(cat "$PID_FILE")" >/dev/null 2>&1; then + log_error "Server process exited immediately. Last log lines:" + tail -20 "$LOG_FILE" 2>/dev/null + rm -f "$PID_FILE" + exit 1 + fi + log_success "SGLang started (PID: $(cat "$PID_FILE"))" +} + +start_trtllm() { + log_info "Starting TRT-LLM server..." + log_info "TRT-LLM serving is not automated by this script." + log_info "Options for TRT-LLM deployment:" + + cat < \\ + --quant fp8,nvfp4 \\ + --effective_bits 4.5 + +# Option 2: Python API +python3 -c " +from tensorrt_llm import LLM, SamplingParams +llm = LLM(model='$MODEL') +print(llm.generate(['Hello, my name is'], SamplingParams(temperature=0.8))) +" +TRTEOF + + log_warn "TRT-LLM server mode not yet automated in this script." + log_warn "Use vLLM or SGLang for OpenAI-compatible serving of ModelOpt checkpoints." + return 1 +} + +wait_for_server() { + log_info "Waiting for server at http://localhost:$PORT ..." + local elapsed=0 + while [[ $elapsed -lt $MAX_WAIT ]]; do + if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then + log_success "Server is ready! (${elapsed}s)" + return 0 + fi + + # Check if process died + if ! is_server_running; then + log_error "Server process died. Check logs: $LOG_FILE" + tail -20 "$LOG_FILE" 2>/dev/null + exit 1 + fi + + sleep 5 + elapsed=$((elapsed + 5)) + printf "." + done + + echo "" + log_error "Server not ready after ${MAX_WAIT}s. Check logs: $LOG_FILE" + tail -20 "$LOG_FILE" 2>/dev/null + exit 1 +} + +stop_server() { + if ! is_server_running; then + log_warn "Server is not running" + return 0 + fi + + local pid + pid=$(cat "$PID_FILE") + log_info "Stopping server (PID: $pid)..." + + # Kill the process group to catch child processes (vLLM/SGLang may fork) + kill -- -"$pid" 2>/dev/null || kill "$pid" 2>/dev/null || true + + # Wait for graceful shutdown + for i in {1..15}; do + if ! ps -p "$pid" >/dev/null 2>&1; then + rm -f "$PID_FILE" "$META_FILE" + log_success "Server stopped" + return 0 + fi + sleep 1 + done + + # Force kill + log_warn "Force killing..." + kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true + sleep 1 + if ps -p "$pid" >/dev/null 2>&1; then + log_error "Failed to kill server process $pid — manual intervention required" + fi + rm -f "$PID_FILE" "$META_FILE" + + # Check for orphaned GPU worker processes + local orphans + orphans=$(pgrep -f "vllm\|sglang" 2>/dev/null | wc -l) + if [[ "$orphans" -gt 0 ]]; then + log_warn "Found $orphans potential orphaned server processes — run: pkill -f 'vllm|sglang'" + fi + log_success "Server stopped (forced)" +} + +test_api() { + log_info "Testing API at http://localhost:$PORT ..." + + # Health check + if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then + log_error "Server not responding at port $PORT" + exit 1 + fi + log_success "Health check passed" + + # List models + log_info "Available models:" + curl -s "http://localhost:$PORT/v1/models" | python3 -m json.tool 2>/dev/null || true + + # Test completion + log_info "Sending test request..." + local model_id + model_id=$(curl -s "http://localhost:$PORT/v1/models" | python3 -c " +import sys, json +data = json.load(sys.stdin) +print(data['data'][0]['id']) +" 2>/dev/null) + + if [[ -z "$model_id" ]]; then + log_error "Could not determine model ID from /v1/models endpoint" + exit 1 + fi + + local payload + payload=$(python3 -c " +import json, sys +print(json.dumps({'model': sys.argv[1], 'prompt': 'The capital of France is', 'max_tokens': 32, 'temperature': 0.7})) +" "$model_id") + + local response + response=$(curl -s "http://localhost:$PORT/v1/completions" \ + -H "Content-Type: application/json" \ + -d "$payload") + + echo "$response" | python3 -m json.tool 2>/dev/null || echo "$response" + + local text + text=$(echo "$response" | python3 -c " +import sys, json +data = json.load(sys.stdin) +print(data['choices'][0]['text']) +" 2>/dev/null) + + if [[ -n "$text" ]]; then + log_success "API test passed!" + printf "${GREEN}Response:${NC} %s\n" "$text" + else + log_error "No valid response from API" + exit 1 + fi +} + +show_status() { + echo "=== ModelOpt Deployment Status ===" + echo "" + if is_server_running; then + local pid + pid=$(cat "$PID_FILE") + log_success "Server running (PID: $pid)" + + # Read saved metadata safely (no source — avoids shell injection) + if [[ -f "$META_FILE" ]]; then + FRAMEWORK=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'") + MODEL=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'") + PORT=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'") + QUANTIZATION=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'") + TP_SIZE=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'") + fi + + echo " Framework: ${FRAMEWORK:-unknown}" + echo " Model: ${MODEL:-unknown}" + echo " Endpoint: http://localhost:${PORT:-8000}" + echo " Logs: $LOG_FILE" + echo "" + if [[ -f "$LOG_FILE" ]]; then + echo "Recent logs:" + tail -5 "$LOG_FILE" + fi + else + log_warn "Server is not running" + echo " Start with: $0 start --model " + fi +} + +# ─── Argument Parsing ──────────────────────────────────────────────── + +COMMAND="" +while [[ $# -gt 0 ]]; do + case $1 in + --model|--framework|--port|--tp|--quantization|--gpu-memory-utilization|--log-dir) + if [[ -z "${2:-}" || "$2" == -* ]]; then + log_error "Option $1 requires a value" + usage + fi + ;;& + --model) MODEL="$2"; _CLI_MODEL=1; shift 2 ;; + --framework) FRAMEWORK="$2"; _CLI_FRAMEWORK=1; shift 2 ;; + --port) PORT="$2"; _CLI_PORT=1; shift 2 ;; + --tp) TP_SIZE="$2"; _CLI_TP=1; shift 2 ;; + --quantization) QUANTIZATION="$2"; _CLI_QUANT=1; shift 2 ;; + --gpu-memory-utilization) VRAM="$2"; shift 2 ;; + --log-dir) LOG_DIR="$2"; LOG_FILE="$LOG_DIR/server.log"; PID_FILE="$LOG_DIR/server.pid"; META_FILE="$LOG_DIR/server.meta"; shift 2 ;; + start|stop|test|status|restart|detect) + COMMAND="$1"; shift ;; + *) + log_error "Unknown option: $1" + usage ;; + esac +done + +if [[ -z "$COMMAND" ]]; then + usage +fi + +# Validate numeric arguments +if [[ -n "$PORT" && ! "$PORT" =~ ^[0-9]+$ ]]; then + log_error "--port must be a number, got: $PORT" + exit 1 +fi +if [[ -n "$TP_SIZE" && ! "$TP_SIZE" =~ ^[1-9][0-9]*$ ]]; then + log_error "--tp must be a positive integer, got: $TP_SIZE" + exit 1 +fi + +# Execute +case "$COMMAND" in + start) start_server ;; + stop) stop_server ;; + test) test_api ;; + status) show_status ;; + restart) + # Load ALL fields from metadata, then let CLI args override + if [[ -f "$META_FILE" ]]; then + _saved_model=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'") + _saved_framework=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'") + _saved_port=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'") + _saved_quant=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'") + _saved_tp=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'") + # Apply saved values as defaults; CLI args (tracked via _CLI_*) win + [[ -z "${_CLI_MODEL:-}" ]] && MODEL="${_saved_model:-$MODEL}" + [[ -z "${_CLI_FRAMEWORK:-}" ]] && FRAMEWORK="${_saved_framework:-$FRAMEWORK}" + [[ -z "${_CLI_PORT:-}" ]] && PORT="${_saved_port:-$PORT}" + [[ -z "${_CLI_QUANT:-}" ]] && QUANTIZATION="${_saved_quant:-$QUANTIZATION}" + [[ -z "${_CLI_TP:-}" ]] && TP_SIZE="${_saved_tp:-$TP_SIZE}" + fi + stop_server; sleep 2; start_server ;; + detect) + if [[ -z "$MODEL" ]]; then + log_error "--model is required for detect" + exit 1 + fi + if ! quant=$(detect_quantization "$MODEL"); then + exit 1 + fi + echo "Detected quantization: $quant" + ;; + *) usage ;; +esac diff --git a/.claude/skills/deployment/tests/remote-slurm-deployment.json b/.claude/skills/deployment/tests/remote-slurm-deployment.json new file mode 100644 index 0000000000..9a6aec688d --- /dev/null +++ b/.claude/skills/deployment/tests/remote-slurm-deployment.json @@ -0,0 +1,17 @@ +{ + "skills": ["deployment"], + "query": "deploy my quantized model on the SLURM cluster", + "files": [], + "expected_behavior": [ + "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml", + "Sources .claude/skills/common/remote_exec.sh", + "Calls remote_load_cluster, remote_check_ssh, remote_detect_env", + "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local", + "For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)", + "Starts vLLM/SGLang server inside the container via srun", + "Gets allocated node hostname from squeue -j $JOBID -o %N", + "Verifies remotely: remote_run 'curl -s http://localhost:8000/health'", + "Reports the remote endpoint (http://:8000) and notes SLURM network restrictions", + "Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags" + ] +} diff --git a/.claude/skills/deployment/tests/vllm-fp8-local.json b/.claude/skills/deployment/tests/vllm-fp8-local.json new file mode 100644 index 0000000000..42a636b304 --- /dev/null +++ b/.claude/skills/deployment/tests/vllm-fp8-local.json @@ -0,0 +1,20 @@ +{ + "skills": ["deployment"], + "query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM", + "files": [], + "expected_behavior": [ + "Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint", + "Reads hf_quant_config.json and detects FP8 quantization format", + "Confirms vLLM is the chosen framework", + "Checks vLLM is installed and version >= 0.10.1", + "Detects local GPU via nvidia-smi or torch.cuda", + "Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU", + "Reads references/vllm.md for deployment instructions", + "Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000", + "Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8", + "Waits for server health check at /health endpoint", + "Verifies /v1/models lists the model", + "Sends test generation request to /v1/completions and confirms coherent output", + "Reports server URL (http://localhost:8000) and port to user" + ] +} diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml index 4c5a690145..45bd2eb683 100644 --- a/.markdownlint-cli2.yaml +++ b/.markdownlint-cli2.yaml @@ -5,3 +5,8 @@ config: MD033: false # no-inline-html MD041: false # first-line-heading MD059: false # no-hard-tabs + MD029: false # ol-prefix - allow 1. 2. 3. style numbered lists + MD032: false # blanks-around-lists - don't force blank lines around lists + MD036: false # no-emphasis-as-heading - allow **bold** as section markers + MD005: false # list-indent - allow flexible list item indentation + MD007: false # ul-indent - allow unindented sub-lists under numbered lists