diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md
new file mode 100644
index 0000000000..c621042953
--- /dev/null
+++ b/.claude/skills/deployment/SKILL.md
@@ -0,0 +1,268 @@
+---
+name: deployment
+description: Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint.
+license: Apache-2.0
+---
+
+# Deployment Skill
+
+Serve a model checkpoint as an OpenAI-compatible inference endpoint. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
+
+## Quick Start
+
+Prefer `scripts/deploy.sh` for standard local deployments — it handles quant detection, health checks, and server lifecycle. Use the raw framework commands in Step 4 when you need flags the script doesn't support, or for remote deployment.
+
+```bash
+# Start vLLM server with a ModelOpt checkpoint
+scripts/deploy.sh start --model ./qwen3-0.6b-fp8
+
+# Start with SGLang and tensor parallelism
+scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4
+
+# Start from HuggingFace hub
+scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8
+
+# Test the API
+scripts/deploy.sh test
+
+# Check status
+scripts/deploy.sh status
+
+# Stop
+scripts/deploy.sh stop
+```
+
+The script handles: GPU detection, quantization flag auto-detection (FP8 vs FP4), server lifecycle (start/stop/restart/status), health check polling, and API testing.
+
+## Decision Flow
+
+### 0. Check workspace (multi-user / Slack bot)
+
+If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Before creating a new workspace, check for existing ones — especially if deploying a checkpoint from a prior PTQ run:
+
+```bash
+ls "$MODELOPT_WORKSPACE_ROOT/" 2>/dev/null
+```
+
+If the user says "deploy the model I just quantized" or references a previous PTQ, find the matching workspace and `cd` into it. The checkpoint should be in that workspace's output directory.
+
+### 1. Identify the checkpoint
+
+Determine what the user wants to deploy:
+
+- **Local quantized checkpoint** (from ptq skill or manual export): look for `hf_quant_config.json` in the directory. If coming from a prior PTQ run in the same workspace, check common output locations: `output/`, `outputs/`, `exported_model/`, or the `--export_path` used in the PTQ command.
+- **HuggingFace model hub** (e.g., `nvidia/Llama-3.1-8B-Instruct-FP8`): use directly
+- **Unquantized model**: deploy as-is (BF16) or suggest quantizing first with the ptq skill
+
+> **Note:** This skill expects HF-format checkpoints (from PTQ with `--export_fmt hf`). TRT-LLM format checkpoints should be deployed directly with TRT-LLM — see `references/trtllm.md`.
+
+Check the quantization format if applicable:
+
+```bash
+cat <checkpoint_path>/hf_quant_config.json 2>/dev/null || echo "No hf_quant_config.json"
+```
+
+If not found, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither exists, the checkpoint is unquantized.
+
+### 2. Choose the framework
+
+If the user hasn't specified a framework, recommend based on this priority:
+
+| Situation | Recommended | Why |
+|-----------|-------------|-----|
+| General use | **vLLM** | Widest ecosystem, easy setup, OpenAI-compatible |
+| Best SGLang model support | **SGLang** | Strong DeepSeek/Llama 4 support |
+| Maximum optimization | **TRT-LLM** | Best throughput via engine compilation |
+| Mixed-precision / AutoQuant | **TRT-LLM AutoDeploy** | Only option for AutoQuant checkpoints |
+
+Check the support matrix in `references/support-matrix.md` to confirm the model + format + framework combination is supported.
+
+### 3. Check the environment
+
+**GPU availability:**
+
+```bash
+python -c "import torch; [print(f'GPU {i}: {torch.cuda.get_device_name(i)}') for i in range(torch.cuda.device_count())] if torch.cuda.is_available() else print('no-gpu')"
+```
+
+**Framework installed?**
+
+```bash
+# vLLM
+python -c "import vllm; print(f'vLLM {vllm.__version__}')" 2>/dev/null || echo "vLLM not installed"
+
+# SGLang
+python -c "import sglang; print(f'SGLang {sglang.__version__}')" 2>/dev/null || echo "SGLang not installed"
+
+# TRT-LLM
+python -c "import tensorrt_llm; print(f'TRT-LLM {tensorrt_llm.__version__}')" 2>/dev/null || echo "TRT-LLM not installed"
+```
+
+If the framework is not installed, consult `references/setup.md` for installation instructions.
+
+**GPU memory estimate:**
+
+- BF16 model: `num_params × 2 bytes` (e.g., 8B model ≈ 16 GB)
+- FP8 model: `num_params × 1 byte` (e.g., 8B model ≈ 8 GB)
+- FP4 model: `num_params × 0.5 bytes` (e.g., 8B model ≈ 4 GB)
+- Add ~2-4 GB for KV cache and framework overhead
+
+If the model exceeds single GPU memory, use tensor parallelism (`-tp <num_gpus>`).
+
+### 4. Deploy
+
+Read the framework-specific reference for detailed instructions:
+
+| Framework | Reference file |
+|-----------|---------------|
+| vLLM | `references/vllm.md` |
+| SGLang | `references/sglang.md` |
+| TRT-LLM | `references/trtllm.md` |
+
+**Quick-start commands** (for common cases):
+
+#### vLLM
+
+```bash
+# Serve as OpenAI-compatible endpoint
+python -m vllm.entrypoints.openai.api_server \
+    --model <checkpoint_path> \
+    --quantization modelopt \
+    --tensor-parallel-size <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
+
+#### SGLang
+
+```bash
+python -m sglang.launch_server \
+    --model-path <checkpoint_path> \
+    --quantization modelopt \
+    --tp <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+#### TRT-LLM (direct)
+
+```python
+from tensorrt_llm import LLM, SamplingParams
+llm = LLM(model="<checkpoint_path>")
+outputs = llm.generate(["Hello, my name is"], SamplingParams(temperature=0.8, top_p=0.95))
+```
+
+#### TRT-LLM AutoDeploy
+
+For AutoQuant or mixed-precision checkpoints, see `references/trtllm.md`.
+
+### 5. Verify the deployment
+
+After the server starts, verify it's healthy:
+
+```bash
+# Health check
+curl -s http://localhost:8000/health
+
+# List models
+curl -s http://localhost:8000/v1/models | python -m json.tool
+
+# Test generation
+curl -s http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "<model_name>",
+        "prompt": "The capital of France is",
+        "max_tokens": 32
+    }' | python -m json.tool
+```
+
+All checks must pass before reporting success to the user.
+
+### 6. Benchmark (optional)
+
+If the user wants throughput/latency numbers, run a quick benchmark:
+
+```bash
+# vLLM benchmark
+python -m vllm.entrypoints.openai.api_server ... &  # if not already running
+
+python -m vllm.benchmark_serving \
+    --model <model_name> \
+    --port 8000 \
+    --num-prompts 100 \
+    --request-rate 10
+```
+
+Report: throughput (tok/s), latency p50/p99, time to first token (TTFT).
+
+### 7. Remote deployment (SSH/SLURM)
+
+If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:
+
+1. **Source remote utilities:**
+
+   ```bash
+   source .claude/skills/common/remote_exec.sh
+   remote_load_cluster
+   remote_check_ssh
+   remote_detect_env
+   ```
+
+2. **Sync the checkpoint** (only if it was produced locally):
+
+   If the checkpoint path is a remote/absolute path (e.g., from a prior PTQ run on the cluster), skip sync — it's already there. Verify with `remote_run "ls <checkpoint_path>/config.json"`. Only sync if the checkpoint is local:
+
+   ```bash
+   remote_sync_to <local_checkpoint_path> checkpoints/
+   ```
+
+3. **Deploy based on remote environment:**
+
+   - **SLURM** — write a job script that starts the server inside a container, then submit:
+
+     ```bash
+     srun --container-image="<container.sqsh>" \
+         --container-mounts="<data_root>:<data_root>" \
+         python -m vllm.entrypoints.openai.api_server \
+             --model <remote_checkpoint_path> \
+             --quantization modelopt \
+             --host 0.0.0.0 --port 8000
+     ```
+
+     Use `remote_submit_job` and `remote_poll_job` to manage the job. The server runs on the allocated node — get its hostname from `squeue -j $JOBID -o %N`.
+
+   - **Bare metal / Docker** — use `remote_run` to start the server directly:
+
+     ```bash
+     remote_run "nohup python -m vllm.entrypoints.openai.api_server --model <path> --port 8000 > deploy.log 2>&1 &"
+     ```
+
+4. **Verify remotely:**
+
+   ```bash
+   remote_run "curl -s http://localhost:8000/health"
+   remote_run "curl -s http://localhost:8000/v1/models"
+   ```
+
+5. **Report the endpoint** — include the remote hostname and port so the user can connect (e.g., `http://<node_hostname>:8000`). For SLURM, note that the port is only reachable from within the cluster network.
+
+For NEL-managed deployment (evaluation with self-deployment), use the evaluation skill instead — NEL handles SLURM container deployment, health checks, and teardown automatically.
+
+## Error Handling
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
+| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
+| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
+| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
+| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |
+
+## Success Criteria
+
+1. Server process is running and healthy (`/health` returns 200)
+2. Model is listed at `/v1/models`
+3. Test generation produces coherent output
+4. Server URL and port are reported to the user
+5. If benchmarking was requested, throughput/latency numbers are reported
diff --git a/.claude/skills/deployment/references/setup.md b/.claude/skills/deployment/references/setup.md
new file mode 100644
index 0000000000..4209f08647
--- /dev/null
+++ b/.claude/skills/deployment/references/setup.md
@@ -0,0 +1,85 @@
+# Deployment Environment Setup
+
+## Framework Installation
+
+### vLLM
+
+```bash
+pip install vllm
+```
+
+Minimum version: 0.10.1
+
+### SGLang
+
+```bash
+pip install "sglang[all]"
+```
+
+Minimum version: 0.4.10
+
+### TRT-LLM
+
+TRT-LLM is best installed via NVIDIA container:
+
+```bash
+docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
+```
+
+Or via pip (requires CUDA toolkit):
+
+```bash
+pip install tensorrt-llm
+```
+
+Minimum version: 0.17.0
+
+## SLURM Deployment
+
+For SLURM clusters, deploy inside a container. Container flags MUST be on the `srun` line:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=deploy
+#SBATCH --account=<account>
+#SBATCH --partition=<partition>
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=<num_gpus>
+#SBATCH --time=04:00:00
+#SBATCH --output=deploy_%j.log
+
+srun \
+    --container-image="<path/to/container.sqsh>" \
+    --container-mounts="<data_root>:<data_root>" \
+    --container-workdir="<workdir>" \
+    --no-container-mount-home \
+    bash -c "python -m vllm.entrypoints.openai.api_server \
+        --model <checkpoint_path> \
+        --quantization modelopt \
+        --tensor-parallel-size <num_gpus> \
+        --host 0.0.0.0 --port 8000"
+```
+
+To access the server from outside the SLURM node, note the allocated hostname:
+
+```bash
+squeue -u $USER -o "%j %N %S"  # Get the node name
+# Then SSH tunnel or use the node's hostname directly
+```
+
+## Docker Deployment
+
+### vLLM with ModelOpt
+
+A Dockerfile is available at `examples/vllm_serve/Dockerfile`:
+
+```bash
+docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
+
+docker run --gpus all -p 8000:8000 vllm-modelopt \
+    python -m vllm.entrypoints.openai.api_server \
+        --model <checkpoint_path> \
+        --quantization modelopt \
+        --host 0.0.0.0 --port 8000
+```
diff --git a/.claude/skills/deployment/references/sglang.md b/.claude/skills/deployment/references/sglang.md
new file mode 100644
index 0000000000..62d5c57b59
--- /dev/null
+++ b/.claude/skills/deployment/references/sglang.md
@@ -0,0 +1,81 @@
+# SGLang Deployment Reference
+
+## Requirements
+
+- SGLang >= 0.4.10
+- `pip install sglang[all]`
+
+## Server Deployment
+
+### As OpenAI-compatible server
+
+```bash
+python -m sglang.launch_server \
+    --model-path <checkpoint_path> \
+    --quantization modelopt \
+    --tp <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
+
+### As Python API
+
+```python
+import sglang as sgl
+
+llm = sgl.Engine(model_path="<checkpoint_path>", quantization="modelopt")
+# For FP4: quantization="modelopt_fp4"
+
+sampling_params = {"temperature": 0.8, "top_p": 0.95}
+outputs = llm.generate(["Hello, my name is"], sampling_params)
+
+for output in outputs:
+    print(f"Generated: {output['text']}")
+```
+
+### From HuggingFace Hub
+
+```python
+import sglang as sgl
+
+llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")
+outputs = llm.generate(["What is AI?"], {"temperature": 0.8})
+```
+
+## Speculative Decoding
+
+SGLang supports speculative decoding with EAGLE and EAGLE3 models:
+
+```bash
+python -m sglang.launch_server \
+    --model-path <target_model> \
+    --speculative-algorithm EAGLE \
+    --speculative-draft-model-path <draft_model> \
+    --speculative-num-steps 3 \
+    --speculative-eagle-topk 4 \
+    --tp <num_gpus> \
+    --host 0.0.0.0 --port 8000
+```
+
+Reference: `examples/specdec_bench/specdec_bench/models/sglang.py`
+
+## Key SGLang Flags
+
+| Flag | Description |
+|------|-------------|
+| `--model-path` | Path to checkpoint or HF model ID |
+| `--quantization` | `modelopt` (FP8) or `modelopt_fp4` (FP4) |
+| `--tp` | Tensor parallelism size |
+| `--ep` | Expert parallelism (for MoE models) |
+| `--enable-torch-compile` | Enable torch.compile for better perf |
+| `--cuda-graph-max-bs` | Max batch size for CUDA graphs |
+| `--attention-backend` | `flashinfer` (default) or `triton` |
+
+## Common Issues
+
+| Issue | Fix |
+|-------|-----|
+| `quantization="modelopt"` not recognized | Upgrade SGLang to >= 0.4.10 |
+| DeepSeek FP4 not working | Check support matrix — SGLang FP4 support varies by model |
+| OOM on startup | Increase `--tp` or reduce `--max-total-tokens` |
diff --git a/.claude/skills/deployment/references/support-matrix.md b/.claude/skills/deployment/references/support-matrix.md
new file mode 100644
index 0000000000..8d0a671537
--- /dev/null
+++ b/.claude/skills/deployment/references/support-matrix.md
@@ -0,0 +1,58 @@
+# Deployment Support Matrix
+
+## Unified HF Checkpoint — Framework Compatibility
+
+| Model | Quant Format | TRT-LLM | vLLM | SGLang |
+|-------|-------------|---------|------|--------|
+| Llama 3.x | FP8 | yes | yes | yes |
+| Llama 3.x | FP4 | yes | yes | yes |
+| Llama 4 | FP8 | yes | — | yes |
+| Llama 4 | FP4 | yes | — | — |
+| DeepSeek R1 | FP8 | yes | yes | yes |
+| DeepSeek R1 | FP4 | yes | yes | yes |
+| DeepSeek V3 | FP8 | yes | yes | yes |
+| DeepSeek V3 | FP4 | yes | yes | yes |
+| Qwen 3 | FP8 | yes | yes | yes |
+| Qwen 3 | FP4 | yes | yes | — |
+| Qwen 3 MoE | FP8 | yes | yes | yes |
+| Qwen 3 MoE | FP4 | yes | — | — |
+| Qwen 2.5 | FP8 | yes | yes | yes |
+| Qwen 2.5 | FP4 | yes | yes | — |
+| QwQ-32B | FP8 | yes | yes | yes |
+| QwQ-32B | FP4 | yes | yes | — |
+| Mixtral 8x7B | FP8 | yes | yes | yes |
+| Mixtral 8x7B | FP4 | yes | — | — |
+
+## Supported Quantization Formats
+
+| Format | Description |
+|--------|-------------|
+| FP8 | 8-bit floating point (E4M3) |
+| FP8_PB | 8-bit floating point with per-block scaling |
+| NVFP4 | NVIDIA 4-bit floating point |
+| NVFP4_AWQ | NVIDIA 4-bit floating point with AWQ optimization |
+| INT4_AWQ | 4-bit integer with AWQ (TRT-LLM only) |
+| W4A8_AWQ | 4-bit weights, 8-bit activations with AWQ (TRT-LLM only) |
+
+## Minimum Framework Versions
+
+| Framework | Minimum Version |
+|-----------|----------------|
+| TensorRT-LLM | v0.17.0 |
+| vLLM | v0.10.1 |
+| SGLang | v0.4.10 |
+
+## Quantization Flag by Framework
+
+| Framework | FP8 flag | FP4 flag |
+|-----------|----------|----------|
+| vLLM | `quantization="modelopt"` | `quantization="modelopt_fp4"` |
+| SGLang | `quantization="modelopt"` | `quantization="modelopt_fp4"` |
+| TRT-LLM | auto-detected from checkpoint | auto-detected from checkpoint |
+
+## Notes
+
+- **NVFP4 inference requires Blackwell GPUs** (B100, B200, GB200). Hopper can run FP4 calibration but not inference.
+- INT4_AWQ and W4A8_AWQ are only supported by TRT-LLM (not vLLM or SGLang).
+- Other models/formats may work but are not officially validated.
+- Source: `examples/llm_ptq/README.md` and `docs/source/deployment/3_unified_hf.rst`
diff --git a/.claude/skills/deployment/references/trtllm.md b/.claude/skills/deployment/references/trtllm.md
new file mode 100644
index 0000000000..5725bed3bf
--- /dev/null
+++ b/.claude/skills/deployment/references/trtllm.md
@@ -0,0 +1,109 @@
+# TRT-LLM Deployment Reference
+
+## Requirements
+
+- TensorRT-LLM >= 0.17.0
+- Typically installed via NVIDIA container: `nvcr.io/nvidia/tensorrt-llm/release:<version>`
+- Or: `pip install tensorrt-llm`
+
+## Direct LLM API (recommended for unified HF checkpoints)
+
+### Python API
+
+```python
+from tensorrt_llm import LLM, SamplingParams
+
+llm = LLM(model="<checkpoint_path>")
+# Quantization format is auto-detected from hf_quant_config.json
+
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+outputs = llm.generate(["Hello, my name is"], sampling_params)
+
+for output in outputs:
+    print(f"Prompt: {output.prompt!r}, Generated: {output.outputs[0].text!r}")
+```
+
+### From HuggingFace Hub
+
+```python
+from tensorrt_llm import LLM
+
+llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8")
+print(llm.generate(["What is AI?"]))
+```
+
+### With tensor parallelism
+
+```python
+from tensorrt_llm import LLM
+
+llm = LLM(model="<checkpoint_path>", tensor_parallel_size=4)
+```
+
+## AutoDeploy (for AutoQuant / mixed-precision)
+
+AutoDeploy automates graph transformations for optimized inference. Required for AutoQuant checkpoints.
+
+### End-to-end script
+
+```bash
+# Quantize and deploy in one step
+./examples/llm_autodeploy/scripts/run_auto_quant_and_deploy.sh \
+    --hf_ckpt <model_path> \
+    --save_quantized_ckpt <output_path> \
+    --quant fp8,nvfp4 \
+    --effective_bits 4.5
+```
+
+Parameters:
+
+- `--hf_ckpt`: Path to unquantized HuggingFace checkpoint
+- `--save_quantized_ckpt`: Output path for quantized checkpoint
+- `--quant`: Quantization formats (e.g., `fp8,nvfp4`)
+- `--effective_bits`: Target precision (higher = more accuracy for sensitive layers)
+- `--world_size`: Number of GPUs for tensor parallelism
+- `--calib_batch_size`: Calibration batch size (reduce if OOM, default 8)
+
+### AutoDeploy API server
+
+```python
+# examples/llm_autodeploy/api_server.py provides a FastAPI server
+# with OpenAI-compatible endpoints using AutoDeploy
+```
+
+### Test AutoDeploy
+
+```bash
+python examples/llm_autodeploy/api_client.py --prompt "What is AI?" "What is golf?"
+```
+
+### Notes
+
+- NVFP4 in AutoDeploy requires Blackwell GPUs
+- For Hopper: remove `nvfp4` from `--quant` and set `--effective_bits` above 8.0
+- AutoDeploy supports CUDA graphs, torch compile backends, and KV cache optimization
+
+## Legacy TRT-LLM Checkpoint (deprecated)
+
+The legacy export path using `export_tensorrt_llm_checkpoint()` is deprecated. Use the unified HF checkpoint format with `export_hf_checkpoint()` instead.
+
+If you encounter a legacy checkpoint (no `hf_quant_config.json`, has `rank*.safetensors` pattern), it needs the TRT-LLM build API to create an engine before deployment. See `docs/source/deployment/1_tensorrt_llm.rst`.
+
+## Evaluation with TRT-LLM
+
+```python
+# examples/llm_eval/lm_eval_tensorrt_llm.py
+# Runs lm_evaluation_harness benchmarks with TRT-LLM
+python examples/llm_eval/lm_eval_tensorrt_llm.py \
+    --model_path <checkpoint_path> \
+    --tasks gsm8k,mmlu
+```
+
+## Common Issues
+
+| Issue | Fix |
+|-------|-----|
+| `No module named tensorrt_llm` | Install via container or pip |
+| NVFP4 inference fails on Hopper | NVFP4 requires Blackwell GPUs for inference |
+| Slow first inference | Engine compilation happens on first run; subsequent runs are cached |
+| OOM during engine build | Reduce `--max_batch_size` or increase TP |
diff --git a/.claude/skills/deployment/references/vllm.md b/.claude/skills/deployment/references/vllm.md
new file mode 100644
index 0000000000..89e06bde42
--- /dev/null
+++ b/.claude/skills/deployment/references/vllm.md
@@ -0,0 +1,91 @@
+# vLLM Deployment Reference
+
+## Requirements
+
+- vLLM >= 0.10.1
+- `pip install vllm`
+
+## Realquant Deployment (recommended)
+
+Realquant uses dedicated quantized kernels for maximum performance. This is the default path for ModelOpt-exported checkpoints.
+
+### As OpenAI-compatible server
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model <checkpoint_path> \
+    --quantization modelopt \
+    --tensor-parallel-size <num_gpus> \
+    --host 0.0.0.0 --port 8000 \
+    --served-model-name <model_name>
+```
+
+For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
+
+### As Python API
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(model="<checkpoint_path>", quantization="modelopt")
+# For FP4: quantization="modelopt_fp4"
+
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+outputs = llm.generate(["Hello, my name is"], sampling_params)
+
+for output in outputs:
+    print(f"Prompt: {output.prompt!r}, Generated: {output.outputs[0].text!r}")
+```
+
+### From HuggingFace Hub
+
+```python
+from vllm import LLM, SamplingParams
+
+llm = LLM(model="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")
+outputs = llm.generate(["What is AI?"], SamplingParams(temperature=0.8))
+```
+
+## Fakequant Deployment (research)
+
+Fakequant is 2-5x slower than realquant but doesn't require dedicated kernel support. Useful for research and testing new quantization schemes.
+
+Reference: `examples/vllm_serve/`
+
+```bash
+# Environment variables for configuration
+export QUANT_CFG=NVFP4_DEFAULT_CFG    # Quantization format
+export QUANT_CALIB_SIZE=512            # Calibration samples
+export QUANT_DATASET=cnn_dailymail     # Calibration dataset
+
+python examples/vllm_serve/vllm_serve_fakequant.py <model_path> \
+    -tp <num_gpus> --host 0.0.0.0 --port 8000
+```
+
+## Benchmarking
+
+```bash
+# Start server first, then benchmark
+python -m vllm.benchmark_serving \
+    --model <model_name> \
+    --port 8000 \
+    --num-prompts 100 \
+    --request-rate 10
+```
+
+Or use lm_eval for accuracy:
+
+```bash
+lm_eval --model local-completions \
+    --tasks gsm8k \
+    --model_args model=<model_name>,base_url=http://localhost:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=128
+```
+
+## Common Issues
+
+| Issue | Fix |
+|-------|-----|
+| `quantization="modelopt"` not recognized | Upgrade vLLM to >= 0.10.1 |
+| OOM on startup | Increase `--tensor-parallel-size` or reduce `--max-model-len` |
+| AWQ checkpoints not loading | AWQ is not supported in vLLM via modelopt path; use FP8 or NVFP4 |
+| Mixed precision not working | Not supported for fakequant |
diff --git a/.claude/skills/deployment/scripts/deploy.sh b/.claude/skills/deployment/scripts/deploy.sh
new file mode 100755
index 0000000000..3147a73dfd
--- /dev/null
+++ b/.claude/skills/deployment/scripts/deploy.sh
@@ -0,0 +1,581 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ModelOpt Deployment Script
+# Deploy quantized or unquantized models via vLLM, SGLang, or TRT-LLM
+# Supports ModelOpt FP8/FP4 checkpoints with automatic quantization flag detection
+
+set -eo pipefail
+
+# Default configuration
+MODEL=""
+PORT=8000
+HOST="0.0.0.0"
+FRAMEWORK="vllm"
+TP_SIZE=1
+VRAM=0.9
+MAX_WAIT=300  # 5 min for large models
+QUANTIZATION=""  # auto-detected from checkpoint
+
+# Paths
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+LOG_DIR="${LOG_DIR:-/tmp/modelopt-deploy}"
+LOG_FILE="$LOG_DIR/server.log"
+PID_FILE="$LOG_DIR/server.pid"
+META_FILE="$LOG_DIR/server.meta"  # persists model/framework/port for status
+
+# Colors
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m'
+
+log_info()    { printf "${BLUE}[INFO]${NC} %s\n" "$1" >&2; }
+log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1" >&2; }
+log_warn()    { printf "${YELLOW}[WARN]${NC} %s\n" "$1" >&2; }
+log_error()   { printf "${RED}[ERROR]${NC} %s\n" "$1" >&2; }
+
+usage() {
+    cat <<EOF
+Usage: $0 <command> [OPTIONS]
+
+Commands:
+  start    - Start the inference server
+  stop     - Stop the inference server
+  test     - Test the API endpoint
+  status   - Show server status
+  restart  - Restart the server
+  detect   - Detect checkpoint format (without starting)
+
+Options:
+  --model PATH              Model path or HF model ID (required for start)
+  --framework FRAMEWORK     vllm, sglang, or trtllm (default: vllm)
+  --port PORT               Server port (default: 8000)
+  --tp SIZE                 Tensor parallel size (default: 1)
+  --quantization QUANT      Force quantization flag (modelopt, modelopt_fp4, or none)
+  --gpu-memory-utilization  GPU memory utilization 0.0-1.0 (default: 0.9)
+  --log-dir DIR             Log directory (default: /tmp/modelopt-deploy)
+
+Examples:
+  $0 start --model ./qwen3-0.6b-fp8
+  $0 start --model ./llama-70b-nvfp4 --framework sglang --tp 4
+  $0 start --model nvidia/Llama-3.1-8B-Instruct-FP8 --framework vllm
+  $0 test --port 8000
+  $0 stop
+EOF
+    exit 1
+}
+
+# ─── Checkpoint Detection ───────────────────────────────────────────
+
+detect_quantization() {
+    local model_path="$1"
+
+    # Skip detection for HF model IDs (no local path)
+    if [[ ! -d "$model_path" ]]; then
+        log_info "Model is a HF ID — using name-based heuristic for quantization flag"
+        # Best-effort: infer from model name. This is a fallback; local checkpoints
+        # use hf_quant_config.json which is reliable.
+        if echo "$model_path" | grep -qi "fp8"; then
+            log_info "HF model name contains 'fp8' — assuming modelopt quantization"
+            echo "modelopt"
+        elif echo "$model_path" | grep -qi "fp4\|nvfp4"; then
+            log_info "HF model name contains 'fp4/nvfp4' — assuming modelopt_fp4 quantization"
+            echo "modelopt_fp4"
+        else
+            log_info "No quantization format detected in model name — treating as unquantized"
+            echo "none"
+        fi
+        return
+    fi
+
+    # Require python3 for JSON parsing
+    if ! command -v python3 &>/dev/null; then
+        log_error "python3 is required to detect quantization format but is not installed"
+        return 1
+    fi
+
+    # Local checkpoint: check hf_quant_config.json
+    local quant_config="$model_path/hf_quant_config.json"
+    if [[ -f "$quant_config" ]]; then
+        log_info "Found hf_quant_config.json"
+
+        local quant_algo
+        quant_algo=$(python3 -c "
+import json, sys
+with open(sys.argv[1]) as f:
+    cfg = json.load(f)
+quant_algo = cfg.get('quantization', {}).get('quant_algo', '')
+print(quant_algo)
+" "$quant_config" 2>&1) || {
+            log_error "Failed to parse hf_quant_config.json: $quant_algo"
+            return 1
+        }
+
+        if echo "$quant_algo" | grep -qi "fp4"; then
+            echo "modelopt_fp4"
+        else
+            echo "modelopt"
+        fi
+    elif [[ -f "$model_path/config.json" ]]; then
+        # Fallback: check config.json for quantization_config with quant_method=modelopt
+        local quant_method
+        quant_method=$(python3 -c "
+import json, sys
+with open(sys.argv[1]) as f:
+    cfg = json.load(f)
+qc = cfg.get('quantization_config', {})
+if qc.get('quant_method') == 'modelopt':
+    print(qc.get('quant_algo', 'fp8'))
+" "$model_path/config.json" 2>&1) || {
+            log_error "Failed to parse config.json: $quant_method"
+            return 1
+        }
+        if [[ -n "$quant_method" ]]; then
+            log_info "Found quantization_config in config.json (quant_method=modelopt)"
+            if echo "$quant_method" | grep -qi "fp4"; then
+                echo "modelopt_fp4"
+            else
+                echo "modelopt"
+            fi
+        else
+            log_info "No quantization config found — treating as unquantized"
+            echo "none"
+        fi
+    else
+        log_info "No hf_quant_config.json or config.json found — treating as unquantized"
+        echo "none"
+    fi
+}
+
+detect_gpu() {
+    if command -v nvidia-smi &>/dev/null; then
+        local gpu_count
+        gpu_count=$(nvidia-smi -L 2>/dev/null | wc -l)
+        local gpu_name
+        gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1)
+        log_info "GPUs: ${gpu_count}x ${gpu_name}"
+        echo "$gpu_count"
+    else
+        log_error "No NVIDIA GPU detected (nvidia-smi not found)"
+        return 1
+    fi
+}
+
+# ─── Server Management ──────────────────────────────────────────────
+
+is_server_running() {
+    if [[ -f "$PID_FILE" ]]; then
+        local pid
+        pid=$(cat "$PID_FILE" 2>/dev/null)
+        if [[ ! "$pid" =~ ^[0-9]+$ ]]; then
+            rm -f "$PID_FILE"
+            return 1
+        fi
+        # Verify the PID is actually a Python/vLLM/SGLang process (not PID reuse)
+        local cmdline
+        cmdline=$(ps -p "$pid" -o args= 2>/dev/null) || { rm -f "$PID_FILE"; return 1; }
+        if echo "$cmdline" | grep -q "vllm\|sglang\|python"; then
+            return 0
+        fi
+        # PID exists but is not our server — stale PID file
+        rm -f "$PID_FILE"
+    fi
+    return 1
+}
+
+start_server() {
+    # Validate GPU availability and TP size
+    local gpu_count
+    gpu_count=$(detect_gpu) || exit 1
+    if [[ "$TP_SIZE" -gt "$gpu_count" ]]; then
+        log_error "Requested TP size ($TP_SIZE) exceeds available GPUs ($gpu_count)"
+        exit 1
+    fi
+
+    if [[ -z "$MODEL" ]]; then
+        log_error "--model is required"
+        usage
+    fi
+
+    if is_server_running; then
+        log_warn "Server already running (PID: $(cat "$PID_FILE"))"
+        return 0
+    fi
+
+    # Check if port is already in use
+    if ss -tlnp 2>/dev/null | grep -q ":${PORT} " || \
+       lsof -i ":${PORT}" -sTCP:LISTEN >/dev/null 2>&1; then
+        log_error "Port $PORT is already in use — stop the existing service or use --port <other_port>"
+        exit 1
+    fi
+
+    mkdir -p "$LOG_DIR"
+
+    # Auto-detect quantization if not forced
+    if [[ -z "$QUANTIZATION" ]]; then
+        if ! QUANTIZATION=$(detect_quantization "$MODEL"); then
+            log_error "Failed to detect quantization — fix the checkpoint or use --quantization to override"
+            exit 1
+        fi
+    fi
+    log_info "Quantization: $QUANTIZATION"
+
+    # Save metadata for status command (values single-quoted for safe reading)
+    cat >"$META_FILE" <<METAEOF
+FRAMEWORK='${FRAMEWORK//\'/}'
+MODEL='${MODEL//\'/}'
+PORT='${PORT//\'/}'
+QUANTIZATION='${QUANTIZATION//\'/}'
+TP_SIZE='${TP_SIZE//\'/}'
+METAEOF
+
+    # Build and run the command
+    case "$FRAMEWORK" in
+        vllm)
+            start_vllm
+            ;;
+        sglang)
+            start_sglang
+            ;;
+        trtllm)
+            start_trtllm
+            ;;
+        *)
+            log_error "Unknown framework: $FRAMEWORK (use vllm, sglang, or trtllm)"
+            exit 1
+            ;;
+    esac
+
+    # Wait for server readiness
+    wait_for_server
+}
+
+start_vllm() {
+    log_info "Starting vLLM server..."
+
+    local -a cmd=(python3 -m vllm.entrypoints.openai.api_server
+        --model "$MODEL"
+        --host "$HOST" --port "$PORT"
+        --tensor-parallel-size "$TP_SIZE"
+        --gpu-memory-utilization "$VRAM")
+
+    if [[ "$QUANTIZATION" != "none" ]]; then
+        cmd+=(--quantization "$QUANTIZATION")
+    fi
+
+    log_info "Command: ${cmd[*]}"
+    nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 &
+    echo $! >"$PID_FILE"
+
+    # Check for immediate crash (missing module, port conflict, CUDA error)
+    sleep 2
+    if ! ps -p "$(cat "$PID_FILE")" >/dev/null 2>&1; then
+        log_error "Server process exited immediately. Last log lines:"
+        tail -20 "$LOG_FILE" 2>/dev/null
+        rm -f "$PID_FILE"
+        exit 1
+    fi
+    log_success "vLLM started (PID: $(cat "$PID_FILE"))"
+}
+
+start_sglang() {
+    log_info "Starting SGLang server..."
+
+    local -a cmd=(python3 -m sglang.launch_server
+        --model-path "$MODEL"
+        --host "$HOST" --port "$PORT"
+        --tp "$TP_SIZE")
+
+    if [[ "$QUANTIZATION" != "none" ]]; then
+        cmd+=(--quantization "$QUANTIZATION")
+    fi
+
+    log_info "Command: ${cmd[*]}"
+    nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 &
+    echo $! >"$PID_FILE"
+
+    # Check for immediate crash
+    sleep 2
+    if ! ps -p "$(cat "$PID_FILE")" >/dev/null 2>&1; then
+        log_error "Server process exited immediately. Last log lines:"
+        tail -20 "$LOG_FILE" 2>/dev/null
+        rm -f "$PID_FILE"
+        exit 1
+    fi
+    log_success "SGLang started (PID: $(cat "$PID_FILE"))"
+}
+
+start_trtllm() {
+    log_info "Starting TRT-LLM server..."
+    log_info "TRT-LLM serving is not automated by this script."
+    log_info "Options for TRT-LLM deployment:"
+
+    cat <<TRTEOF
+
+# Option 1: AutoDeploy (recommended)
+./examples/llm_autodeploy/scripts/run_auto_quant_and_deploy.sh \\
+    --hf_ckpt "$MODEL" \\
+    --save_quantized_ckpt <output_path> \\
+    --quant fp8,nvfp4 \\
+    --effective_bits 4.5
+
+# Option 2: Python API
+python3 -c "
+from tensorrt_llm import LLM, SamplingParams
+llm = LLM(model='$MODEL')
+print(llm.generate(['Hello, my name is'], SamplingParams(temperature=0.8)))
+"
+TRTEOF
+
+    log_warn "TRT-LLM server mode not yet automated in this script."
+    log_warn "Use vLLM or SGLang for OpenAI-compatible serving of ModelOpt checkpoints."
+    return 1
+}
+
+wait_for_server() {
+    log_info "Waiting for server at http://localhost:$PORT ..."
+    local elapsed=0
+    while [[ $elapsed -lt $MAX_WAIT ]]; do
+        if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+            log_success "Server is ready! (${elapsed}s)"
+            return 0
+        fi
+
+        # Check if process died
+        if ! is_server_running; then
+            log_error "Server process died. Check logs: $LOG_FILE"
+            tail -20 "$LOG_FILE" 2>/dev/null
+            exit 1
+        fi
+
+        sleep 5
+        elapsed=$((elapsed + 5))
+        printf "."
+    done
+
+    echo ""
+    log_error "Server not ready after ${MAX_WAIT}s. Check logs: $LOG_FILE"
+    tail -20 "$LOG_FILE" 2>/dev/null
+    exit 1
+}
+
+stop_server() {
+    if ! is_server_running; then
+        log_warn "Server is not running"
+        return 0
+    fi
+
+    local pid
+    pid=$(cat "$PID_FILE")
+    log_info "Stopping server (PID: $pid)..."
+
+    # Kill the process group to catch child processes (vLLM/SGLang may fork)
+    kill -- -"$pid" 2>/dev/null || kill "$pid" 2>/dev/null || true
+
+    # Wait for graceful shutdown
+    for i in {1..15}; do
+        if ! ps -p "$pid" >/dev/null 2>&1; then
+            rm -f "$PID_FILE" "$META_FILE"
+            log_success "Server stopped"
+            return 0
+        fi
+        sleep 1
+    done
+
+    # Force kill
+    log_warn "Force killing..."
+    kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true
+    sleep 1
+    if ps -p "$pid" >/dev/null 2>&1; then
+        log_error "Failed to kill server process $pid — manual intervention required"
+    fi
+    rm -f "$PID_FILE" "$META_FILE"
+
+    # Check for orphaned GPU worker processes
+    local orphans
+    orphans=$(pgrep -f "vllm\|sglang" 2>/dev/null | wc -l)
+    if [[ "$orphans" -gt 0 ]]; then
+        log_warn "Found $orphans potential orphaned server processes — run: pkill -f 'vllm|sglang'"
+    fi
+    log_success "Server stopped (forced)"
+}
+
+test_api() {
+    log_info "Testing API at http://localhost:$PORT ..."
+
+    # Health check
+    if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+        log_error "Server not responding at port $PORT"
+        exit 1
+    fi
+    log_success "Health check passed"
+
+    # List models
+    log_info "Available models:"
+    curl -s "http://localhost:$PORT/v1/models" | python3 -m json.tool 2>/dev/null || true
+
+    # Test completion
+    log_info "Sending test request..."
+    local model_id
+    model_id=$(curl -s "http://localhost:$PORT/v1/models" | python3 -c "
+import sys, json
+data = json.load(sys.stdin)
+print(data['data'][0]['id'])
+" 2>/dev/null)
+
+    if [[ -z "$model_id" ]]; then
+        log_error "Could not determine model ID from /v1/models endpoint"
+        exit 1
+    fi
+
+    local payload
+    payload=$(python3 -c "
+import json, sys
+print(json.dumps({'model': sys.argv[1], 'prompt': 'The capital of France is', 'max_tokens': 32, 'temperature': 0.7}))
+" "$model_id")
+
+    local response
+    response=$(curl -s "http://localhost:$PORT/v1/completions" \
+        -H "Content-Type: application/json" \
+        -d "$payload")
+
+    echo "$response" | python3 -m json.tool 2>/dev/null || echo "$response"
+
+    local text
+    text=$(echo "$response" | python3 -c "
+import sys, json
+data = json.load(sys.stdin)
+print(data['choices'][0]['text'])
+" 2>/dev/null)
+
+    if [[ -n "$text" ]]; then
+        log_success "API test passed!"
+        printf "${GREEN}Response:${NC} %s\n" "$text"
+    else
+        log_error "No valid response from API"
+        exit 1
+    fi
+}
+
+show_status() {
+    echo "=== ModelOpt Deployment Status ==="
+    echo ""
+    if is_server_running; then
+        local pid
+        pid=$(cat "$PID_FILE")
+        log_success "Server running (PID: $pid)"
+
+        # Read saved metadata safely (no source — avoids shell injection)
+        if [[ -f "$META_FILE" ]]; then
+            FRAMEWORK=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            MODEL=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            PORT=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            QUANTIZATION=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            TP_SIZE=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+        fi
+
+        echo "  Framework:    ${FRAMEWORK:-unknown}"
+        echo "  Model:        ${MODEL:-unknown}"
+        echo "  Endpoint:     http://localhost:${PORT:-8000}"
+        echo "  Logs:         $LOG_FILE"
+        echo ""
+        if [[ -f "$LOG_FILE" ]]; then
+            echo "Recent logs:"
+            tail -5 "$LOG_FILE"
+        fi
+    else
+        log_warn "Server is not running"
+        echo "  Start with: $0 start --model <path>"
+    fi
+}
+
+# ─── Argument Parsing ────────────────────────────────────────────────
+
+COMMAND=""
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --model|--framework|--port|--tp|--quantization|--gpu-memory-utilization|--log-dir)
+            if [[ -z "${2:-}" || "$2" == -* ]]; then
+                log_error "Option $1 requires a value"
+                usage
+            fi
+            ;;&
+        --model)               MODEL="$2"; _CLI_MODEL=1; shift 2 ;;
+        --framework)           FRAMEWORK="$2"; _CLI_FRAMEWORK=1; shift 2 ;;
+        --port)                PORT="$2"; _CLI_PORT=1; shift 2 ;;
+        --tp)                  TP_SIZE="$2"; _CLI_TP=1; shift 2 ;;
+        --quantization)        QUANTIZATION="$2"; _CLI_QUANT=1; shift 2 ;;
+        --gpu-memory-utilization) VRAM="$2"; shift 2 ;;
+        --log-dir)             LOG_DIR="$2"; LOG_FILE="$LOG_DIR/server.log"; PID_FILE="$LOG_DIR/server.pid"; META_FILE="$LOG_DIR/server.meta"; shift 2 ;;
+        start|stop|test|status|restart|detect)
+            COMMAND="$1"; shift ;;
+        *)
+            log_error "Unknown option: $1"
+            usage ;;
+    esac
+done
+
+if [[ -z "$COMMAND" ]]; then
+    usage
+fi
+
+# Validate numeric arguments
+if [[ -n "$PORT" && ! "$PORT" =~ ^[0-9]+$ ]]; then
+    log_error "--port must be a number, got: $PORT"
+    exit 1
+fi
+if [[ -n "$TP_SIZE" && ! "$TP_SIZE" =~ ^[1-9][0-9]*$ ]]; then
+    log_error "--tp must be a positive integer, got: $TP_SIZE"
+    exit 1
+fi
+
+# Execute
+case "$COMMAND" in
+    start)   start_server ;;
+    stop)    stop_server ;;
+    test)    test_api ;;
+    status)  show_status ;;
+    restart)
+        # Load ALL fields from metadata, then let CLI args override
+        if [[ -f "$META_FILE" ]]; then
+            _saved_model=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            _saved_framework=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            _saved_port=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            _saved_quant=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            _saved_tp=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            # Apply saved values as defaults; CLI args (tracked via _CLI_*) win
+            [[ -z "${_CLI_MODEL:-}" ]]     && MODEL="${_saved_model:-$MODEL}"
+            [[ -z "${_CLI_FRAMEWORK:-}" ]] && FRAMEWORK="${_saved_framework:-$FRAMEWORK}"
+            [[ -z "${_CLI_PORT:-}" ]]      && PORT="${_saved_port:-$PORT}"
+            [[ -z "${_CLI_QUANT:-}" ]]     && QUANTIZATION="${_saved_quant:-$QUANTIZATION}"
+            [[ -z "${_CLI_TP:-}" ]]        && TP_SIZE="${_saved_tp:-$TP_SIZE}"
+        fi
+        stop_server; sleep 2; start_server ;;
+    detect)
+        if [[ -z "$MODEL" ]]; then
+            log_error "--model is required for detect"
+            exit 1
+        fi
+        if ! quant=$(detect_quantization "$MODEL"); then
+            exit 1
+        fi
+        echo "Detected quantization: $quant"
+        ;;
+    *)       usage ;;
+esac
diff --git a/.claude/skills/deployment/tests/remote-slurm-deployment.json b/.claude/skills/deployment/tests/remote-slurm-deployment.json
new file mode 100644
index 0000000000..9a6aec688d
--- /dev/null
+++ b/.claude/skills/deployment/tests/remote-slurm-deployment.json
@@ -0,0 +1,17 @@
+{
+  "skills": ["deployment"],
+  "query": "deploy my quantized model on the SLURM cluster",
+  "files": [],
+  "expected_behavior": [
+    "Checks for cluster config at ~/.config/modelopt/clusters.yaml or .claude/clusters.yaml",
+    "Sources .claude/skills/common/remote_exec.sh",
+    "Calls remote_load_cluster, remote_check_ssh, remote_detect_env",
+    "Checks if checkpoint is already on remote (e.g., from prior PTQ run) before syncing; only syncs if local",
+    "For SLURM: writes a job script with srun --container-image and --container-mounts on srun line (not #SBATCH)",
+    "Starts vLLM/SGLang server inside the container via srun",
+    "Gets allocated node hostname from squeue -j $JOBID -o %N",
+    "Verifies remotely: remote_run 'curl -s http://localhost:8000/health'",
+    "Reports the remote endpoint (http://<node_hostname>:8000) and notes SLURM network restrictions",
+    "Reads framework-specific reference (references/vllm.md or references/sglang.md) for deployment flags"
+  ]
+}
diff --git a/.claude/skills/deployment/tests/vllm-fp8-local.json b/.claude/skills/deployment/tests/vllm-fp8-local.json
new file mode 100644
index 0000000000..42a636b304
--- /dev/null
+++ b/.claude/skills/deployment/tests/vllm-fp8-local.json
@@ -0,0 +1,20 @@
+{
+  "skills": ["deployment"],
+  "query": "deploy my quantized model at ./qwen3-0.6b-fp8 with vLLM",
+  "files": [],
+  "expected_behavior": [
+    "Identifies ./qwen3-0.6b-fp8 as a local quantized checkpoint",
+    "Reads hf_quant_config.json and detects FP8 quantization format",
+    "Confirms vLLM is the chosen framework",
+    "Checks vLLM is installed and version >= 0.10.1",
+    "Detects local GPU via nvidia-smi or torch.cuda",
+    "Estimates GPU memory: 0.6B params x 1 byte (FP8) = ~0.6 GB, fits single GPU",
+    "Reads references/vllm.md for deployment instructions",
+    "Uses deploy.sh or runs: python -m vllm.entrypoints.openai.api_server --model ./qwen3-0.6b-fp8 --quantization modelopt --host 0.0.0.0 --port 8000",
+    "Passes --quantization modelopt (not modelopt_fp4) since checkpoint is FP8",
+    "Waits for server health check at /health endpoint",
+    "Verifies /v1/models lists the model",
+    "Sends test generation request to /v1/completions and confirms coherent output",
+    "Reports server URL (http://localhost:8000) and port to user"
+  ]
+}
diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml
index 4c5a690145..45bd2eb683 100644
--- a/.markdownlint-cli2.yaml
+++ b/.markdownlint-cli2.yaml
@@ -5,3 +5,8 @@ config:
   MD033: false # no-inline-html
   MD041: false # first-line-heading
   MD059: false # no-hard-tabs
+  MD029: false # ol-prefix - allow 1. 2. 3. style numbered lists
+  MD032: false # blanks-around-lists - don't force blank lines around lists
+  MD036: false # no-emphasis-as-heading - allow **bold** as section markers
+  MD005: false # list-indent - allow flexible list item indentation
+  MD007: false # ul-indent - allow unindented sub-lists under numbered lists