Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new deployment skill with detailed references and tooling for serving quantized or unquantized LLM checkpoints via an OpenAI-compatible HTTP API, including framework guides (vLLM, SGLang, TRT-LLM), setup (SLURM/Docker), a lifecycle Bash CLI ( Changes
Sequence Diagram(s)sequenceDiagram
participant User as User (CLI)
participant Script as Deploy Script
participant Checkpoint as Checkpoint (Local / HF)
participant Framework as Framework (vLLM / SGLang / TRT-LLM)
participant Health as Health Endpoint (/health)
User->>Script: run `deploy.sh start --model ... --framework ...`
Script->>Checkpoint: detect_quantization()
Checkpoint-->>Script: return quantization (modelopt / modelopt_fp4 / none)
Script->>Framework: launch with chosen flags/command
Framework-->>Script: PID / process started
Script->>Health: poll /health until ready
Health-->>Script: healthy
Script-->>User: report endpoint and status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/deployment/references/setup.md:
- Around line 77-84: The example run command launches
vllm.entrypoints.openai.api_server with --model <checkpoint_path> but does not
bind-mount the local checkpoint into the container; update the docker run
invocation to include a bind mount (e.g., -v /path/on/host:/model_dir) and
change the --model argument to the checkpoint path inside the container (e.g.,
/model_dir/<checkpoint_file>) so the process in the container can access the
checkpoint.
In @.claude/skills/deployment/scripts/deploy.sh:
- Line 67: The usage text and dispatcher currently advertise and accept "trtllm"
as a FRAMEWORK but start_trtllm() only prints manual instructions and exits
non-zero, causing start --framework trtllm to always fail; remove "trtllm" from
the supported framework list in usage() and the framework-dispatch paths (places
that parse --framework) or implement a real startup flow in start_trtllm() so it
behaves like start_vllm()/start_sglang(); specifically update the usage() string
(the line showing "--framework FRAMEWORK"), adjust any framework-dispatch logic
that treats "trtllm" as first-class, and either fully implement start_trtllm()
or stop advertising "trtllm" until it is implemented.
- Around line 103-119: The current check around quant_config (variable
quant_config) and the inline python snippet only tests for "fp4" and maps
everything else to "modelopt", which misclassifies other quant_algos (e.g.,
INT4_AWQ, W4A8_AWQ); modify the logic in the python snippet (or add a small
wrapper) to parse cfg.get('quantization', {}).get('quant_algo', '') into a
variable and explicitly branch: if quant_algo contains "fp4" echo
"modelopt_fp4", else if quant_algo matches known TRT-only/other formats like
"INT4_AWQ" or "W4A8_AWQ" echo a distinct value (or exit non‑zero) so they don’t
get treated as modelopt, otherwise echo "modelopt" (or fail fast for unknown
values); update the echo outputs and error handling accordingly so functions
relying on those outputs (the inline python call and the script logic that
consumes "modelopt_fp4"/"modelopt") can correctly select the backend.
- Around line 173-179: The server.meta write currently injects raw user values
into META_FILE (variable META_FILE/server.meta) and later is sourced, causing
command injection and loss of metadata on restart; change the META_FILE writer
(the block that writes FRAMEWORK, MODEL, PORT, QUANTIZATION, TP_SIZE) to store
safely escaped/quoted values or a non-executable format (e.g., JSON or
POSIX-safe key="value" lines with proper escaping) and stop using shell source
to read it; update the reader (where META_FILE is sourced) to parse the file
safely (e.g., read with IFS and parameter expansion or a JSON parser) instead of
sourcing, and modify stop_server so it does not delete META_FILE (or ensure
start_server reads/restores metadata from META_FILE before any deletion) so
restart (the restart command path that calls stop_server then start_server) can
recover required --model and other args.
- Around line 89-123: The detect_quantization function is emitting informational
log_info() lines to stdout which pollute the captured output used by callers
(e.g., QUANTIZATION=$(detect_quantization "$MODEL")); update the three
log_info() calls inside detect_quantization so they no longer write to
stdout—either remove them or make them write to stderr instead—so the function
prints only the quantization token (modelopt/modelopt_fp4/none) to stdout; after
the change ensure callers that parse detect_quantization (the QUANTIZATION
assignment and subsequent --quantization use) receive a single clean value.
- Around line 216-217: The startup sequence uses nohup "${cmd[@]}" >"$LOG_FILE"
2>&1 & and then writes echo $! >"$PID_FILE", which does not guarantee the saved
PID is a process group leader; change the invocation to use setsid so the
launched process becomes a proper process group leader (use nohup setsid
"${cmd[@]}" >"$LOG_FILE" 2>&1 &), and apply this change in both start routines
that reference the cmd array, LOG_FILE and PID_FILE (the vLLM and SGLang start
functions) so that subsequent kill -- -"$pid" targets the correct process group.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a63104f3-2e62-4c24-bd61-fe69b4b4fdf8
📒 Files selected for processing (8)
.claude/skills/deployment/SKILL.md.claude/skills/deployment/references/setup.md.claude/skills/deployment/references/sglang.md.claude/skills/deployment/references/support-matrix.md.claude/skills/deployment/references/trtllm.md.claude/skills/deployment/references/vllm.md.claude/skills/deployment/scripts/deploy.sh.markdownlint-cli2.yaml
| ```bash | ||
| docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt . | ||
|
|
||
| docker run --gpus all -p 8000:8000 vllm-modelopt \ | ||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model <checkpoint_path> \ | ||
| --quantization modelopt \ | ||
| --host 0.0.0.0 --port 8000 |
There was a problem hiding this comment.
Mount the checkpoint into the container in this example.
This command tells vLLM to load <checkpoint_path>, but the container only sees its own filesystem. Without a bind mount here, the documented local-checkpoint flow fails unless the checkpoint is already baked into the image.
💡 Suggested doc fix
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
-docker run --gpus all -p 8000:8000 vllm-modelopt \
+docker run --gpus all -p 8000:8000 \
+ -v <checkpoint_dir>:<checkpoint_dir> \
+ vllm-modelopt \
python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--host 0.0.0.0 --port 8000🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/references/setup.md around lines 77 - 84, The
example run command launches vllm.entrypoints.openai.api_server with --model
<checkpoint_path> but does not bind-mount the local checkpoint into the
container; update the docker run invocation to include a bind mount (e.g., -v
/path/on/host:/model_dir) and change the --model argument to the checkpoint path
inside the container (e.g., /model_dir/<checkpoint_file>) so the process in the
container can access the checkpoint.
|
|
||
| Options: | ||
| --model PATH Model path or HF model ID (required for start) | ||
| --framework FRAMEWORK vllm, sglang, or trtllm (default: vllm) |
There was a problem hiding this comment.
Don't advertise trtllm as a supported --framework yet.
usage() and the dispatcher treat TRT-LLM as a first-class backend, but start_trtllm() only prints manual instructions and exits 1. As written, the documented start --framework trtllm flow always fails.
Also applies to: 182-190, 239-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/scripts/deploy.sh at line 67, The usage text and
dispatcher currently advertise and accept "trtllm" as a FRAMEWORK but
start_trtllm() only prints manual instructions and exits non-zero, causing start
--framework trtllm to always fail; remove "trtllm" from the supported framework
list in usage() and the framework-dispatch paths (places that parse --framework)
or implement a real startup flow in start_trtllm() so it behaves like
start_vllm()/start_sglang(); specifically update the usage() string (the line
showing "--framework FRAMEWORK"), adjust any framework-dispatch logic that
treats "trtllm" as first-class, and either fully implement start_trtllm() or
stop advertising "trtllm" until it is implemented.
| nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & | ||
| echo $! >"$PID_FILE" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '216,217p;234,235p;303,318p' .claude/skills/deployment/scripts/deploy.sh
echo
bash -lc '
sleep 30 &
pid=$!
ps -o pid=,pgid=,comm= -p $$ -p $pid
kill $pid
'Repository: NVIDIA/Model-Optimizer
Length of output: 867
🏁 Script executed:
cat -n .claude/skills/deployment/scripts/deploy.sh | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 2001
🏁 Script executed:
# Get full context around the start and stop functions
sed -n '200,240p; 290,320p' .claude/skills/deployment/scripts/deploy.sh | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 2432
🏁 Script executed:
# Test process group behavior in non-interactive bash (as scripts run)
bash -c '
echo "=== Test 1: nohup with background ==="
nohup bash -c "sleep 30; echo done" >/tmp/test1.log 2>&1 &
pid=$!
echo "Parent shell PGID: $$"
echo "Child PID: $pid"
# Cannot use ps in this sandbox, so check what kill would do
echo "Would kill -0 \$pid work: kill -0 $pid && echo yes || echo no"
kill -0 $pid && echo "Process exists" || echo "Process does not exist"
kill $pid 2>/dev/null
echo
echo "=== Test 2: What group ID would kill -- -PID target? ==="
bash -c "sleep 30 &
pid=\$!
echo Attempted to save PID: \$pid
# Show what we are trying to kill
echo Would use: kill -- -\$pid" &
sleep 0.5
'Repository: NVIDIA/Model-Optimizer
Length of output: 315
🏁 Script executed:
# Check the actual code to see if there's any special handling of process groups
rg "setsid|setpgid|set -m" .claude/skills/deployment/scripts/deploy.shRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the server startup command - what does cmd contain?
sed -n '200,220p' .claude/skills/deployment/scripts/deploy.shRepository: NVIDIA/Model-Optimizer
Length of output: 595
Use setsid to make the started process a proper process group leader.
The current code with nohup ... & followed by echo $! >"$PID_FILE" does not guarantee the saved PID is a process group leader in non-interactive bash (job control is disabled by default in scripts). This means kill -- -"$pid" will target the wrong group or fail silently, leaving forked child processes orphaned even though the fallback prevents an error.
Wrap the startup command with setsid:
nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &
This applies to both vLLM (lines 216–217) and SGLang (lines 234–235) start functions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/scripts/deploy.sh around lines 216 - 217, The
startup sequence uses nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & and then writes echo
$! >"$PID_FILE", which does not guarantee the saved PID is a process group
leader; change the invocation to use setsid so the launched process becomes a
proper process group leader (use nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &),
and apply this change in both start routines that reference the cmd array,
LOG_FILE and PID_FILE (the vLLM and SGLang start functions) so that subsequent
kill -- -"$pid" targets the correct process group.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1133 +/- ##
=======================================
Coverage 70.18% 70.18%
=======================================
Files 230 230
Lines 26045 26045
=======================================
Hits 18279 18279
Misses 7766 7766 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add a Claude Code skill for serving quantized or unquantized LLM checkpoints as OpenAI-compatible API endpoints. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy). Features: - Auto-detect quantization format from hf_quant_config.json - Framework recommendation based on model/format/use case - GPU memory estimation and tensor parallelism guidance - Server lifecycle management via deploy.sh script - Health check verification and API testing - Remote deployment via SSH/SLURM - Optional throughput/latency benchmarking Includes framework-specific references (vllm.md, sglang.md, trtllm.md), support matrix, and setup guide. Signed-off-by: Kai Xu <kaix@nvidia.com>
c282546 to
5378a84
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
.claude/skills/deployment/scripts/deploy.sh (1)
473-473:⚠️ Potential issue | 🟠 MajorRestart command loses required
--modelargument.When restart is invoked without arguments,
stop_serverdeletesMETA_FILE, thenstart_serverfails because--modelis required but not recovered. Either preserve metadata across restart or require--modelfor restart.🐛 Proposed fix — preserve metadata for restart
- restart) stop_server; sleep 2; start_server ;; + restart) + # Preserve metadata before stop deletes it + if [[ -f "$META_FILE" && -z "$MODEL" ]]; then + MODEL=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'") + FRAMEWORK=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'") + PORT=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'") + QUANTIZATION=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'") + TP_SIZE=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'") + fi + stop_server + sleep 2 + start_server + ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh at line 473, The restart command currently calls stop_server then start_server which removes META_FILE in stop_server so start_server cannot recover the required --model argument; modify restart to preserve metadata by either (A) changing stop_server to not delete META_FILE when invoked from restart (add a flag or mode parameter like stop_server(restart_mode=true)) or (B) alter restart to read and pass the saved model value to start_server (read META_FILE before calling stop_server and forward the --model value into start_server). Update references in the restart handler and the stop_server/start_server implementations accordingly and ensure META_FILE and the --model handling logic remain consistent.
🧹 Nitpick comments (3)
.claude/skills/deployment/scripts/deploy.sh (2)
333-333: Unused loop variablei(shellcheck SC2034).The variable
iis declared but unused in the for loop. This is harmless but can be silenced.🧹 Proposed fix
- for i in {1..15}; do + for _ in {1..15}; do🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh at line 333, The for loop currently declares an unused variable with "for i in {1..15}; do" which triggers shellcheck SC2034; replace the unused loop variable with a throwaway name (e.g., "_" or "unused") or rewrite the loop as a while loop so no unused variable is created; update the loop header where "for i in {1..15}; do" appears (and any matching "done") so the body executes 15 times without referencing "i".
35-35: Unused variableSCRIPT_DIR.Shellcheck flagged this as unused. Either remove it or add a comment if it's intended for future use.
🧹 Proposed fix
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # Reserved for future useOr simply remove the line if not needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh at line 35, The variable assignment SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" is unused; either remove that line from deploy.sh or, if you intend to keep it as a placeholder, add an explicit comment and silence ShellCheck (e.g., add a comment like "intentionally unused placeholder" and a ShellCheck disable for SC2034) so the unused-variable warning is resolved; refer to the SCRIPT_DIR assignment to locate the change..claude/skills/deployment/SKILL.md (1)
137-145: Add FP4 note for SGLang similar to vLLM.Line 135 notes "For NVFP4 checkpoints, use
--quantization modelopt_fp4" for vLLM, but the SGLang section doesn't include a similar note. Add consistency.📝 Suggested addition after line 145
For NVFP4 checkpoints, use `--quantization modelopt_fp4`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/SKILL.md around lines 137 - 145, The SGLang launch_server docs are missing the FP4 note present in vLLM; update the SGLang section that shows the python -m sglang.launch_server usage (the block with --quantization modelopt) to add a line stating that for NVFP4 checkpoints you should use --quantization modelopt_fp4 so the CLI examples are consistent with the vLLM note.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 48-51: The log_* functions (log_info, log_success, log_warn,
log_error) currently print to stdout and therefore corrupt captured command
substitution outputs (e.g., calls in detect_quantization); change each function
to write to stderr by redirecting the printf output to >&2 so any use of $(...)
will only capture the command output and not log lines, preserving existing
messages and formats but routing them to stderr.
- Around line 127-134: The fallback JSON parsing that sets quant_method
incorrectly reads qc.get('quant_type', 'fp8'), causing ModelOpt-exported configs
to be missed; update the Python snippet that computes quant_method (the python3
-c block that writes to quant_method) to use qc.get('quant_algo', 'fp8') instead
of qc.get('quant_type', 'fp8') so ModelOpt's quantization field is recognized
when hf_quant_config.json is absent.
In @.claude/skills/deployment/SKILL.md:
- Around line 15-33: Update the example commands to use the
repository-root-correct path or add a working-directory note: replace
occurrences of scripts/deploy.sh with
.claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four
start/test/status/stop commands) or add a single-line note above the code block
stating the examples assume you run them from .claude/skills/deployment/ so
users know to adjust the path when invoking scripts/deploy.sh.
- Around line 254-261: Update the table row for the `quantization="modelopt" not
recognized` error to require SGLang >= 0.5 instead of the current `SGLang >=
0.4.10`; locate the row by the exact cell text `quantization="modelopt" not
recognized` and replace the `SGLang >= 0.4.10` string with `SGLang >= 0.5`
(leave the vLLM entry `vLLM >= 0.10.1` as-is or change to a “latest version
recommended” note if desired).
---
Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Line 473: The restart command currently calls stop_server then start_server
which removes META_FILE in stop_server so start_server cannot recover the
required --model argument; modify restart to preserve metadata by either (A)
changing stop_server to not delete META_FILE when invoked from restart (add a
flag or mode parameter like stop_server(restart_mode=true)) or (B) alter restart
to read and pass the saved model value to start_server (read META_FILE before
calling stop_server and forward the --model value into start_server). Update
references in the restart handler and the stop_server/start_server
implementations accordingly and ensure META_FILE and the --model handling logic
remain consistent.
---
Nitpick comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Line 333: The for loop currently declares an unused variable with "for i in
{1..15}; do" which triggers shellcheck SC2034; replace the unused loop variable
with a throwaway name (e.g., "_" or "unused") or rewrite the loop as a while
loop so no unused variable is created; update the loop header where "for i in
{1..15}; do" appears (and any matching "done") so the body executes 15 times
without referencing "i".
- Line 35: The variable assignment SCRIPT_DIR="$(cd "$(dirname
"${BASH_SOURCE[0]}")" && pwd)" is unused; either remove that line from deploy.sh
or, if you intend to keep it as a placeholder, add an explicit comment and
silence ShellCheck (e.g., add a comment like "intentionally unused placeholder"
and a ShellCheck disable for SC2034) so the unused-variable warning is resolved;
refer to the SCRIPT_DIR assignment to locate the change.
In @.claude/skills/deployment/SKILL.md:
- Around line 137-145: The SGLang launch_server docs are missing the FP4 note
present in vLLM; update the SGLang section that shows the python -m
sglang.launch_server usage (the block with --quantization modelopt) to add a
line stating that for NVFP4 checkpoints you should use --quantization
modelopt_fp4 so the CLI examples are consistent with the vLLM note.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b627486a-bec6-4491-9f41-909c529fd071
📒 Files selected for processing (8)
.claude/skills/deployment/SKILL.md.claude/skills/deployment/references/setup.md.claude/skills/deployment/references/sglang.md.claude/skills/deployment/references/support-matrix.md.claude/skills/deployment/references/trtllm.md.claude/skills/deployment/references/vllm.md.claude/skills/deployment/scripts/deploy.sh.markdownlint-cli2.yaml
✅ Files skipped from review due to trivial changes (6)
- .markdownlint-cli2.yaml
- .claude/skills/deployment/references/trtllm.md
- .claude/skills/deployment/references/vllm.md
- .claude/skills/deployment/references/setup.md
- .claude/skills/deployment/references/sglang.md
- .claude/skills/deployment/references/support-matrix.md
| ```bash | ||
| # Start vLLM server with a ModelOpt checkpoint | ||
| scripts/deploy.sh start --model ./qwen3-0.6b-fp8 | ||
|
|
||
| # Start with SGLang and tensor parallelism | ||
| scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4 | ||
|
|
||
| # Start from HuggingFace hub | ||
| scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8 | ||
|
|
||
| # Test the API | ||
| scripts/deploy.sh test | ||
|
|
||
| # Check status | ||
| scripts/deploy.sh status | ||
|
|
||
| # Stop | ||
| scripts/deploy.sh stop | ||
| ``` |
There was a problem hiding this comment.
Clarify script path relative to repository root.
The examples use scripts/deploy.sh but the script is at .claude/skills/deployment/scripts/deploy.sh. Users running from repo root would need the full path.
📝 Suggested clarification
Either update paths to be absolute from repo root:
.claude/skills/deployment/scripts/deploy.sh start --model ./qwen3-0.6b-fp8Or add a note that examples assume the working directory is .claude/skills/deployment/.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/SKILL.md around lines 15 - 33, Update the example
commands to use the repository-root-correct path or add a working-directory
note: replace occurrences of scripts/deploy.sh with
.claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four
start/test/status/stop commands) or add a single-line note above the code block
stating the examples assume you run them from .claude/skills/deployment/ so
users know to adjust the path when invoking scripts/deploy.sh.
| | Error | Cause | Fix | | ||
| |-------|-------|-----| | ||
| | `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model | | ||
| | `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 | | ||
| | `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag | | ||
| | `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors | | ||
| | `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` | | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
vLLM modelopt quantization minimum version requirement
💡 Result:
vLLM's ModelOpt quantization requires a minimum CUDA compute capability of 8.9 (Hopper GPUs like H100 and newer). There is no explicit minimum vLLM software version requirement stated in the official documentation; support is available in recent versions as evidenced by the current docs (published 2025-10-23). ModelOpt quantization was added before April 2025, with extensions like Mixtral support merged around then (PR #15961). Use the latest vLLM version for full compatibility.
Citations:
- 1: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt.html
- 2: https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/modelopt/
- 3: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt/
- 4: https://docs.vllm.ai/en/v0.9.0/api/vllm/model_executor/layers/quantization/base_config.html
- 5: https://docs.vllm.ai/en/v0.10.1/api/vllm/model_executor/layers/quantization/modelopt.html
- 6: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/index.html
- 7: [Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.') vllm-project/vllm#14040
- 8: Add support to modelopt quantization of Mixtral model vllm-project/vllm#15961
🌐 Web query:
SGLang modelopt quantization version requirement minimum
💡 Result:
SGLang's native ModelOpt quantization support requires SGLang version 0.5 or later. This feature was introduced via merged pull requests #7149 (merged October 6, 2025), #9991 (merged October 11, 2025), and #10154, with full integration announced in the LMSYS blog post on December 2, 2025. Official documentation lists --quantization modelopt, --quantization modelopt_fp8, --quantization modelopt_fp4, and --modelopt-quant (e.g., 'fp8', 'nvfp4') as supported options, requiring installation of nvidia-modelopt via pip install nvidia-modelopt. Earlier versions had partial ModelOpt support (e.g., PR #5145 merged April 8, 2025, for auto-detection), but native quantization workflow mandates post-October 2025 releases. Use the latest version (e.g., v0.5.10rc0 as of March 2026) for optimal compatibility with Hopper/Blackwell GPUs.
Citations:
- 1: https://lmsys.org/blog/2025-12-02-modelopt-quantization/
- 2: https://mintlify.com/sgl-project/sglang/advanced/quantization
- 3: https://sgl-project.github.io/advanced_features/quantization.html
- 4: https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md
- 5: Enable native ModelOpt quantization support (2/3) sgl-project/sglang#9991
- 6: Enable native ModelOpt quantization support (1/3) sgl-project/sglang#7149
- 7: https://github.com/sgl-project/sglang
- 8: https://github.com/sgl-project/sglang/releases
🌐 Web query:
vLLM modelopt quantization support version history when added
💡 Result:
vLLM added support for ModelOpt (NVIDIA Model Optimizer) quantization in September 2024 via PR #6112, merged on 2024-09-11. This introduced the "modelopt" quantization method for loading ModelOpt FP8 static scaling checkpoints, usable as LLM(model="path", quantization="modelopt"). It requires CUDA compute capability 8.9+ (Hopper GPUs). Subsequent enhancements include KV cache scaling support (PR #11787, merged Jan 2025) and MXFP8/MXFP4 support in later versions (e.g., v0.14.0+ for some formats). ModelOpt supports FP8 variants like per-tensor, per-channel-per-token, FP8_PB_WO, and mixed precision with NVFP4, using torch._scaled_mm for static weight/activation scales.
Citations:
- 1: https://docs.vllm.ai/en/stable/features/quantization/modelopt/
- 2: https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/quantization/modelopt.html
- 3: https://docs.vllm.ai/en/stable/api/vllm/model_executor/layers/quantization/modelopt/
- 4: [Hardware][NV] Add support for ModelOpt static scaling checkpoints. vllm-project/vllm#6112
- 5: [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. vllm-project/vllm#11787
- 6: https://github.com/vllm-project/vllm/releases
- 7: https://docs.vllm.ai/en/v0.6.2/models/engine_args.html
- 8: https://newreleases.io/project/pypi/vllm/release/0.6.0
Correct SGLang version requirement for ModelOpt quantization support.
The error table states SGLang >= 0.4.10 for --quantization modelopt support, but documentation indicates SGLang >= 0.5 is required. Native ModelOpt quantization support was introduced in SGLang v0.5 (full integration December 2025). Update line 257 to reflect: SGLang >= 0.5.
For vLLM, while official documentation does not explicitly state a minimum version, v0.10.1 is a reasonable conservative estimate given ModelOpt support was added in September 2024; this can remain as-is or be updated to "latest version recommended" if you prefer to match vLLM's official guidance.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/SKILL.md around lines 254 - 261, Update the table
row for the `quantization="modelopt" not recognized` error to require SGLang >=
0.5 instead of the current `SGLang >= 0.4.10`; locate the row by the exact cell
text `quantization="modelopt" not recognized` and replace the `SGLang >= 0.4.10`
string with `SGLang >= 0.5` (leave the vLLM entry `vLLM >= 0.10.1` as-is or
change to a “latest version recommended” note if desired).
5378a84 to
3b8a62e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
.claude/skills/deployment/scripts/deploy.sh (3)
48-51:⚠️ Potential issue | 🔴 CriticalSend log output to stderr.
These helpers are still used inside
$(...), sogpu_count=$(detect_gpu)andQUANTIZATION=$(detect_quantization ...)capture the log lines along with the real value. That makes the TP comparison non-numeric and can pass a multi-line--quantizationvalue downstream.🐛 Proposed fix
-log_info() { printf "${BLUE}[INFO]${NC} %s\n" "$1"; } -log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1"; } -log_warn() { printf "${YELLOW}[WARN]${NC} %s\n" "$1"; } -log_error() { printf "${RED}[ERROR]${NC} %s\n" "$1"; } +log_info() { printf "${BLUE}[INFO]${NC} %s\n" "$1" >&2; } +log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1" >&2; } +log_warn() { printf "${YELLOW}[WARN]${NC} %s\n" "$1" >&2; } +log_error() { printf "${RED}[ERROR]${NC} %s\n" "$1" >&2; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 48 - 51, The log helper functions log_info, log_success, log_warn, and log_error are writing to stdout and polluting command substitutions like gpu_count=$(detect_gpu); change each helper to write to stderr instead (so logs don't get captured in $(...)). Update the implementations of log_info/log_success/log_warn/log_error to direct their printf output to file descriptor 2 (stderr) so command substitutions only capture the actual function return/echoed values, not the log lines.
142-145:⚠️ Potential issue | 🟠 MajorRead
quant_algoin theconfig.jsonfallback.This branch still looks up
quantization_config.quant_type, but the exported ModelOpt metadata usesquant_algo. FP4 checkpoints that only haveconfig.jsonwill therefore fall back to the default and be launched asmodeloptinstead ofmodelopt_fp4.🐛 Proposed fix
if qc.get('quant_method') == 'modelopt': - print(qc.get('quant_type', 'fp8')) + print(qc.get('quant_algo', 'fp8'))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 142 - 145, The script currently checks qc.get('quant_type') when qc.get('quant_method') == 'modelopt' but the ModelOpt metadata uses the key 'quant_algo' in config.json, causing fp4 checkpoints to fall back incorrectly; update the fallback read to also read 'quant_algo' from the config JSON (e.g., parse model_path/config.json into a dict and check ['quant_algo'] if qc lacks 'quant_type') and map/translate that value into the expected qc['quant_type'] (e.g., treat 'fp4' as 'modelopt_fp4') before launching; adjust the branch that references qc, quant_method, and quant_type so it prefers qc['quant_type'] but falls back to config['quant_algo'] mapped appropriately.
382-383:⚠️ Potential issue | 🟠 MajorMake
restartreload the previous deployment config.The normal stop path deletes
server.meta, and therestartbranch immediately callsstart_serverwith empty globals../deploy.sh restarttherefore still falls into--model is requiredunless the caller repeats the fullstartcommand.Also applies to: 542-542
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 382 - 383, The restart path currently removes "$META_FILE" (server.meta) before calling start_server, so restart loses the previous deployment config and triggers the "--model is required" error; modify the restart logic to preserve or reload the existing META_FILE when performing a restart (i.e., do not rm -f "$META_FILE" in the restart branch or copy/restore it into the expected globals before calling start_server) and ensure start_server reads globals from META_FILE if present (referencing PID_FILE, META_FILE, start_server, and log_success to locate the logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 389-396: The force-kill path currently logs an error if ps -p
"$pid" still finds the process but then proceeds to rm -f "$PID_FILE"
"$META_FILE" and continue; change this so that after running kill -9 (both group
and pid attempts) you check ps -p "$pid" and if the process still exists call
log_error and exit non-zero (e.g. exit 1) without removing "$PID_FILE" or
"$META_FILE" or proceeding further; reference the existing kill -9 -- -"$pid" /
kill -9 "$pid", the ps -p "$pid" check, log_error, and the PID_FILE/META_FILE
removals to locate where to add the early exit on failure.
- Around line 343-344: The health-check curl invocations in deploy.sh (the if
that runs curl -s "http://localhost:$PORT/health") treat non-2xx HTTP responses
as success; modify those curl calls to use the -f/--fail flag (e.g., change curl
-s to curl -sf) so curl exits non-zero on HTTP 400+ responses, and apply the
same change to the other health check curl block referenced around lines 411-415
so both checks properly fail on non-2xx responses.
---
Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 48-51: The log helper functions log_info, log_success, log_warn,
and log_error are writing to stdout and polluting command substitutions like
gpu_count=$(detect_gpu); change each helper to write to stderr instead (so logs
don't get captured in $(...)). Update the implementations of
log_info/log_success/log_warn/log_error to direct their printf output to file
descriptor 2 (stderr) so command substitutions only capture the actual function
return/echoed values, not the log lines.
- Around line 142-145: The script currently checks qc.get('quant_type') when
qc.get('quant_method') == 'modelopt' but the ModelOpt metadata uses the key
'quant_algo' in config.json, causing fp4 checkpoints to fall back incorrectly;
update the fallback read to also read 'quant_algo' from the config JSON (e.g.,
parse model_path/config.json into a dict and check ['quant_algo'] if qc lacks
'quant_type') and map/translate that value into the expected qc['quant_type']
(e.g., treat 'fp4' as 'modelopt_fp4') before launching; adjust the branch that
references qc, quant_method, and quant_type so it prefers qc['quant_type'] but
falls back to config['quant_algo'] mapped appropriately.
- Around line 382-383: The restart path currently removes "$META_FILE"
(server.meta) before calling start_server, so restart loses the previous
deployment config and triggers the "--model is required" error; modify the
restart logic to preserve or reload the existing META_FILE when performing a
restart (i.e., do not rm -f "$META_FILE" in the restart branch or copy/restore
it into the expected globals before calling start_server) and ensure
start_server reads globals from META_FILE if present (referencing PID_FILE,
META_FILE, start_server, and log_success to locate the logic).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d11a608-5df5-494b-b088-2a22e54bb58f
📒 Files selected for processing (3)
.claude/skills/deployment/evals/remote-slurm-deployment.json.claude/skills/deployment/evals/vllm-fp8-local.json.claude/skills/deployment/scripts/deploy.sh
✅ Files skipped from review due to trivial changes (2)
- .claude/skills/deployment/evals/remote-slurm-deployment.json
- .claude/skills/deployment/evals/vllm-fp8-local.json
| if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then | ||
| log_success "Server is ready! (${elapsed}s)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "deploy.sh" -type f | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1895
🌐 Web query:
curl -s flag behavior non-2xx responses exit code 404 503
💡 Result:
The -s or --silent flag in curl makes it silent by suppressing the progress meter and error messages. It does not affect the exit code. By default, curl exits with code 0 for any HTTP response (including non-2xx like 404 or 503), as long as the transfer completes successfully, and outputs the response body to stdout. To make curl exit with a non-zero code (22) on HTTP 400+ responses (including 404 and 503): - Use -f or --fail: suppresses response body output on error. - Use --fail-with-body (curl 7.75.0+): outputs response body but still exits with 22. With -s alone, for 404/503 you get exit code 0, no progress meter, no error msg, but response body printed. Example: curl -s http://example.com/404 # exits 0, prints 404 page silently curl -s -f http://example.com/404 # exits 22, no output (silent due to -s) curl -s --fail-with-body http://example.com/404 # exits 22, prints 404 page (no error msg due to -s) To see errors with -s, combine with -S or --show-error: shows error messages on non-zero exit but keeps silent otherwise. Official sources confirm: curl considers HTTP response receipt a success unless --fail used; --silent only mutes progress/error display, not exit behavior.
Citations:
- 1: https://everything.curl.dev/cmdline/exitcode.html
- 2: https://curl.se/docs/manpage.html
- 3: http://curl.se/docs/manpage.html
- 4: https://everything.curl.dev/http/response.html
🌐 Web query:
curl -f -s -S flags meaning --fail --silent --show-error
💡 Result:
The curl flags -f -s -S correspond to --fail --silent --show-error. - -f or --fail: (HTTP) Fail silently (no output at all on stdout) on HTTP server errors (response codes >=400). curl exits with error code 22 instead of outputting the error page. Not fail-safe for auth-related codes like 401/407. - -s or --silent: Silent mode. Hides the progress meter and error messages. - -S or --show-error: When used with -s, shows error messages on stderr despite silence mode. This combination is commonly used in scripts: be quiet normally, but show errors on failure (e.g., to stderr) and exit non-zero on HTTP errors. Example usage detects failures without progress bars but displays error details.
Citations:
- 1: https://curl.se/docs/manpage.html
- 2: https://curl.se/docs/manpage.html?pubDate=20250220
- 3: https://man.archlinux.org/man/curl.1
- 4: https://stackoverflow.com/questions/6935006/making-curl-send-errors-to-stderr-and-everything-else-to-stdout
- 5:
curl --include --fail --show-errordoes not show response headers curl/curl#6911 - 6: https://superuser.com/questions/590099/can-i-make-curl-fail-with-an-exitcode-different-than-0-if-the-http-status-code-i
🌐 Web query:
curl --max-time timeout health check endpoint reasonable value
💡 Result:
A reasonable value for --max-time in curl for health check endpoints is 10 seconds. This provides sufficient time for typical health checks to complete while preventing hangs on unresponsive services. Pair it with --connect-timeout 5 for connection establishment. Common examples from sources: - Docker HEALTHCHECK: --timeout=10s with curl --max-time effectively 10s (19,9) - General health checks: --max-time 10-15s, --connect-timeout 5s (5,11) - Load balancer contexts: Aligns with AWS ALB default HealthCheckTimeoutSeconds=5s (extend for safety), NGINX defaults ~5-10s intervals (14,12,16) Example command: curl --connect-timeout 5 --max-time 10 -f -s http://localhost/health || exit 1 Adjust based on endpoint response time: 2x expected max response time, minimum 5-10s.
Citations:
- 1: https://oxylabs.io/blog/curl-timeout
- 2: https://oneuptime.com/blog/post/2026-02-09-probe-timeout-slow-responses/view
- 3: https://webscraping.ai/faq/curl/how-do-i-use-curl-to-perform-health-checks-on-web-services
- 4: Add --max-time 100 to healthcheck plexinc/pms-docker#66
- 5: https://oneuptime.com/blog/post/2026-01-23-docker-health-checks-effectively/view
- 6: https://karandeepsingh.ca/posts/nodejs-error-handling-best-practices/
- 7: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
- 8: https://docs.aws.amazon.com/cli/latest/reference/elb/configure-health-check.html
- 9: https://oneuptime.com/blog/post/2026-01-06-docker-health-checks/view
Treat non-2xx /health responses as failures.
The curl -s commands exit with code 0 on HTTP error responses (404, 503), allowing the health checks to incorrectly report success. Use -f to make curl exit with non-zero on HTTP 400+ errors.
Proposed fix
- if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+ if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
log_success "Server is ready! (${elapsed}s)"
return 0
fi
@@
- if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+ if ! curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
log_error "Server not responding at port $PORT"
exit 1
fiAlso applies to lines 411-415.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then | |
| log_success "Server is ready! (${elapsed}s)" | |
| if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then | |
| log_success "Server is ready! (${elapsed}s)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/scripts/deploy.sh around lines 343 - 344, The
health-check curl invocations in deploy.sh (the if that runs curl -s
"http://localhost:$PORT/health") treat non-2xx HTTP responses as success; modify
those curl calls to use the -f/--fail flag (e.g., change curl -s to curl -sf) so
curl exits non-zero on HTTP 400+ responses, and apply the same change to the
other health check curl block referenced around lines 411-415 so both checks
properly fail on non-2xx responses.
| # Force kill | ||
| log_warn "Force killing..." | ||
| kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true | ||
| sleep 1 | ||
| if ps -p "$pid" >/dev/null 2>&1; then | ||
| log_error "Failed to kill server process $pid — manual intervention required" | ||
| fi | ||
| rm -f "$PID_FILE" "$META_FILE" |
There was a problem hiding this comment.
Don't report success after a failed force-kill.
If the process is still alive after kill -9, this path logs an error but then continues as if the stop succeeded. That leaves a live server untracked and makes the next deploy fail later with a confusing port conflict.
🐛 Proposed fix
sleep 1
if ps -p "$pid" >/dev/null 2>&1; then
log_error "Failed to kill server process $pid — manual intervention required"
+ return 1
fi
rm -f "$PID_FILE" "$META_FILE"Also applies to: 404-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/scripts/deploy.sh around lines 389 - 396, The
force-kill path currently logs an error if ps -p "$pid" still finds the process
but then proceeds to rm -f "$PID_FILE" "$META_FILE" and continue; change this so
that after running kill -9 (both group and pid attempts) you check ps -p "$pid"
and if the process still exists call log_error and exit non-zero (e.g. exit 1)
without removing "$PID_FILE" or "$META_FILE" or proceeding further; reference
the existing kill -9 -- -"$pid" / kill -9 "$pid", the ps -p "$pid" check,
log_error, and the PID_FILE/META_FILE removals to locate where to add the early
exit on failure.
3b8a62e to
6be8e1f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
.claude/skills/deployment/scripts/deploy.sh (4)
389-396:⚠️ Potential issue | 🟠 MajorDon't clear state after a failed force-kill.
If the process is still alive after
kill -9, this path logs an error and then immediately deletesserver.pid/server.metaand reports success. That leaves a live server untracked and turns the next start or restart into a confusing port conflict.Suggested fix
sleep 1 if ps -p "$pid" >/dev/null 2>&1; then log_error "Failed to kill server process $pid — manual intervention required" + return 1 fi rm -f "$PID_FILE" "$META_FILE"Also applies to: 404-405
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 389 - 396, The post-force-kill path currently removes the PID_FILE and META_FILE regardless of whether the process identified by $pid was actually killed; when ps -p "$pid" indicates the process is still alive you must not delete state. Change the logic around the kill/check in the shutdown sequence (the block with log_warn, kill -9 -- -"$pid", ps -p "$pid", log_error) so that rm -f "$PID_FILE" "$META_FILE" only runs when ps -p "$pid" confirms the process is gone (i.e., on successful kill), and leave the files intact and return a non-success exit (or propagate error) when log_error is triggered, to avoid orphaning a live server with removed state.
67-67:⚠️ Potential issue | 🟠 MajorStop advertising
trtllmas a working--framework.
start --framework trtllmstill bottoms out instart_trtllm()printing manual steps and returning1, so the documented CLI path is a guaranteed failure. That also conflicts with.claude/skills/deployment/SKILL.md:112-158and.claude/skills/deployment/references/trtllm.md:1-56, which present TRT-LLM as a supported deployment path.Also applies to: 244-246, 312-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh at line 67, The help string and documentation advertise "--framework trtllm" as a supported option even though start_trtllm() merely prints manual steps and returns failure; either remove "trtllm" from the CLI option list and all help/docs (references: the help string in deploy.sh, SKILL.md sections, and references/trtllm.md) or implement start_trtllm() to perform a proper deployment and return success; update every occurrence of "trtllm" in the deploy.sh help (--framework) and the duplicated locations noted so help/docs and the start_trtllm() behavior remain consistent.
343-345:⚠️ Potential issue | 🟠 MajorMake
/healthprobes fail on HTTP errors.
curl -sexits0for HTTP404/503, so readiness andtestcan report success while the server is still unhealthy. Use--failand a bounded timeout for both probes.Suggested fix
- if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then + if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then @@ - if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then + if ! curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; thenAlso applies to: 411-415
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 343 - 345, The readiness probe currently uses curl -s which returns success on HTTP error codes; update the curl calls that check "http://localhost:$PORT/health" (the if curl -s ... then block and the similar probe around lines 411-415) to use --fail plus a bounded timeout (e.g. --max-time) and keep --silent/--show-error as desired so curl exits non‑zero on HTTP 4xx/5xx; leave the surrounding logic (log_success, return 0, PORT) intact and apply the same change to both probe locations.
271-272:⚠️ Potential issue | 🟠 MajorLaunch the server in its own process group.
stop_server()relies onkill -- -"$pid", butnohup ... &does not guarantee that$!is a process-group leader in non-interactive Bash. If vLLM or SGLang fork workers, the fallback only kills the parent and can leave serving processes behind.Suggested fix
- nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & + nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 & @@ - nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & + nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &Also applies to: 298-299, 377-377, 391-391
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/deployment/scripts/deploy.sh around lines 271 - 272, Start the server so it becomes a process-group leader (so stop_server() can kill the whole group) by replacing the plain nohup invocation with a process-group-aware launch (e.g., use setsid or start-stop-daemon to start "${cmd[@]}" so the backgrounded $! is the group leader) and ensure you still redirect output to LOG_FILE and record the PID to PID_FILE; apply the same change for each occurrence that currently uses nohup (the blocks that build cmd and write to LOG_FILE/PID_FILE). Ensure stop_server() continues to read PID_FILE and use kill -- -"$pid" against the recorded PID (which must be the group leader).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 182-190: is_server_running() currently trusts any live PID read
from PID_FILE and may act on a reused PID; update it to verify the process
identity before returning success: after reading PID from PID_FILE, check that
the process command line or executable matches the expected server (e.g.,
compare ps -p "$pid" -o cmd= or /proc/"$pid"/cmdline against the known
$SERVER_CMD or $APP_NAME) or alternatively verify the PID is listening on the
expected port (ss/lsof). Do the same validation in stop_server() so it only
signals the verified server PID; apply the same change to the other similar
block referenced (around stop logic at lines 372-377) to avoid killing unrelated
processes.
- Around line 227-233: The server metadata writes currently strip single quotes
and lose exact values (see variables META_FILE, FRAMEWORK, MODEL, PORT,
QUANTIZATION, TP_SIZE) so round-tripping and explicit restart flags break;
change the metadata format to a non-lossy representation (e.g., JSON or
shell-quoted via printf %q) and write the exact variable values to META_FILE,
and update the restart/merge logic so it reads that metadata and merges it with
CLI args with CLI options taking precedence (i.e., only use stored values for
options not explicitly provided). Apply the same fix to the other metadata write
sites in this script (the other server.meta write occurrences) so all
reads/writes are lossless and merged consistently.
---
Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 389-396: The post-force-kill path currently removes the PID_FILE
and META_FILE regardless of whether the process identified by $pid was actually
killed; when ps -p "$pid" indicates the process is still alive you must not
delete state. Change the logic around the kill/check in the shutdown sequence
(the block with log_warn, kill -9 -- -"$pid", ps -p "$pid", log_error) so that
rm -f "$PID_FILE" "$META_FILE" only runs when ps -p "$pid" confirms the process
is gone (i.e., on successful kill), and leave the files intact and return a
non-success exit (or propagate error) when log_error is triggered, to avoid
orphaning a live server with removed state.
- Line 67: The help string and documentation advertise "--framework trtllm" as a
supported option even though start_trtllm() merely prints manual steps and
returns failure; either remove "trtllm" from the CLI option list and all
help/docs (references: the help string in deploy.sh, SKILL.md sections, and
references/trtllm.md) or implement start_trtllm() to perform a proper deployment
and return success; update every occurrence of "trtllm" in the deploy.sh help
(--framework) and the duplicated locations noted so help/docs and the
start_trtllm() behavior remain consistent.
- Around line 343-345: The readiness probe currently uses curl -s which returns
success on HTTP error codes; update the curl calls that check
"http://localhost:$PORT/health" (the if curl -s ... then block and the similar
probe around lines 411-415) to use --fail plus a bounded timeout (e.g.
--max-time) and keep --silent/--show-error as desired so curl exits non‑zero on
HTTP 4xx/5xx; leave the surrounding logic (log_success, return 0, PORT) intact
and apply the same change to both probe locations.
- Around line 271-272: Start the server so it becomes a process-group leader (so
stop_server() can kill the whole group) by replacing the plain nohup invocation
with a process-group-aware launch (e.g., use setsid or start-stop-daemon to
start "${cmd[@]}" so the backgrounded $! is the group leader) and ensure you
still redirect output to LOG_FILE and record the PID to PID_FILE; apply the same
change for each occurrence that currently uses nohup (the blocks that build cmd
and write to LOG_FILE/PID_FILE). Ensure stop_server() continues to read PID_FILE
and use kill -- -"$pid" against the recorded PID (which must be the group
leader).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 112bd8c6-264a-4a6d-b23e-67cea4e2ee22
📒 Files selected for processing (3)
.claude/skills/deployment/evals/remote-slurm-deployment.json.claude/skills/deployment/evals/vllm-fp8-local.json.claude/skills/deployment/scripts/deploy.sh
✅ Files skipped from review due to trivial changes (2)
- .claude/skills/deployment/evals/remote-slurm-deployment.json
- .claude/skills/deployment/evals/vllm-fp8-local.json
| is_server_running() { | ||
| if [[ -f "$PID_FILE" ]]; then | ||
| local pid | ||
| pid=$(cat "$PID_FILE") | ||
| if ps -p "$pid" >/dev/null 2>&1; then | ||
| return 0 | ||
| fi | ||
| fi | ||
| return 1 |
There was a problem hiding this comment.
Don't trust server.pid without verifying the process identity.
is_server_running() accepts any live PID, and stop_server() then signals it. If the original server died and the PID was reused, start will refuse to launch and stop can kill an unrelated process group. Validate something stable (expected command line, start time, or listening port) before treating the PID as this server.
Also applies to: 372-377
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/deployment/scripts/deploy.sh around lines 182 - 190,
is_server_running() currently trusts any live PID read from PID_FILE and may act
on a reused PID; update it to verify the process identity before returning
success: after reading PID from PID_FILE, check that the process command line or
executable matches the expected server (e.g., compare ps -p "$pid" -o cmd= or
/proc/"$pid"/cmdline against the known $SERVER_CMD or $APP_NAME) or
alternatively verify the PID is listening on the expected port (ss/lsof). Do the
same validation in stop_server() so it only signals the verified server PID;
apply the same change to the other similar block referenced (around stop logic
at lines 372-377) to avoid killing unrelated processes.
Signed-off-by: Kai Xu <kaix@nvidia.com>
6be8e1f to
c8d55ce
Compare
What does this PR do?
Type of change: New feature
Add a Claude Code skill for serving quantized or unquantized LLM checkpoints as OpenAI-compatible API endpoints. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).
Skill structure
Features
hf_quant_config.jsonwithconfig.jsonfallbackdeploy.sh(start/stop/restart/status/detect)remote_exec.sh, with smart checkpoint sync)Depends on: #1107 (common files:
remote_exec.sh,workspace-management.md,environment-setup.md)Testing
Invoke in Claude Code:
Before your PR is "Ready for review"
CONTRIBUTING.md: ✅🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
Chores