Skip to content

Add Agent Deployment skill for model serving#1133

Open
kaix-nv wants to merge 2 commits intomainfrom
kaix/agent-deployment
Open

Add Agent Deployment skill for model serving#1133
kaix-nv wants to merge 2 commits intomainfrom
kaix/agent-deployment

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Mar 28, 2026

What does this PR do?

Type of change: New feature

Add a Claude Code skill for serving quantized or unquantized LLM checkpoints as OpenAI-compatible API endpoints. Supports vLLM, SGLang, and TRT-LLM (including AutoDeploy).

Skill structure

deployment/
├── SKILL.md                    Decision workflow (identify → framework → env → deploy → verify)
├── references/
│   ├── vllm.md                 vLLM-specific deployment details
│   ├── sglang.md               SGLang-specific details
│   ├── trtllm.md               TRT-LLM / AutoDeploy details
│   ├── support-matrix.md       Model/format/framework compatibility
│   └── setup.md                Framework installation guides
├── scripts/
│   └── deploy.sh               Local server lifecycle (start/stop/restart/status/test/detect)
└── evals/
    ├── vllm-fp8-local.json     Happy path: local vLLM deployment
    └── remote-slurm-deployment.json  Remote SLURM deployment

Features

  • Auto-detect quantization format from hf_quant_config.json with config.json fallback
  • PTQ checkpoint discovery - find checkpoints from prior PTQ runs in common output directories
  • Framework recommendation based on model/format/use case (vLLM for general, SGLang for DeepSeek/Llama 4, TRT-LLM for max throughput)
  • GPU memory estimation and tensor parallelism guidance
  • Server lifecycle management via deploy.sh (start/stop/restart/status/detect)
  • Health check verification and API testing
  • Remote deployment via SSH/SLURM (using common remote_exec.sh, with smart checkpoint sync)
  • HF-format checkpoint assumption documented — TRT-LLM format has separate path
  • Optional benchmarking for throughput/latency metrics

Depends on: #1107 (common files: remote_exec.sh, workspace-management.md, environment-setup.md)

Testing

Invoke in Claude Code:

claude -p "deploy outputs/Qwen3-0.6B-FP8 with vLLM"

Before your PR is "Ready for review"

  • Is this change backward compatible?: N/A (new feature)
  • If you copied code from any other sources, did you follow guidance in CONTRIBUTING.md: ✅
  • Did you write any new necessary tests?: N/A (skill evals provided separately)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a CLI to deploy, manage, and test OpenAI-compatible inference servers (start/stop/status/test/restart/detect) across vLLM, SGLang, and TensorRT-LLM, with automatic quantization detection, GPU checks, readiness polling, and basic orchestration.
  • Documentation

    • Comprehensive deployment guides: framework installation, quantization support matrix, AutoDeploy/AutoQuant notes, SLURM and Docker examples, benchmarking, verification, and troubleshooting.
  • Tests

    • New evaluation scenarios for local (FP8) and remote (SLURM) deployments.
  • Chores

    • Updated markdown linting configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new deployment skill with detailed references and tooling for serving quantized or unquantized LLM checkpoints via an OpenAI-compatible HTTP API, including framework guides (vLLM, SGLang, TRT-LLM), setup (SLURM/Docker), a lifecycle Bash CLI (deploy.sh), eval configs, and lint adjustments.

Changes

Cohort / File(s) Summary
Skill + Core Guide
.claude/skills/deployment/SKILL.md
New end-to-end deployment skill describing workspace discovery, checkpoint detection, framework selection, environment validation, deployment commands, readiness checks, benchmarking, remote deployment, and error-handling.
Framework References
.claude/skills/deployment/references/vllm.md, .claude/skills/deployment/references/sglang.md, .claude/skills/deployment/references/trtllm.md
New per-framework docs: install/version requirements, server & Python API examples, quantization flag mappings (modelopt / modelopt_fp4), TRT-LLM AutoDeploy/AutoQuant guidance, speculative decoding notes (SGLang), and troubleshooting tables.
Setup & Support Matrix
.claude/skills/deployment/references/setup.md, .claude/skills/deployment/references/support-matrix.md
Setup instructions for SLURM and Docker, model↔quantization↔framework compatibility matrix, supported quant formats, minimum framework versions, and hardware constraints (e.g., NVFP4/Blackwell).
Deployment Script
.claude/skills/deployment/scripts/deploy.sh
Adds executable CLI (`start
Evaluations
.claude/skills/deployment/evals/*
New eval configs for remote SLURM deployment and local vLLM FP8 flow validating detection, start, health, models listing, and completions.
Lint Config
.markdownlint-cli2.yaml
Disabled additional markdownlint rules (MD029, MD032, MD036, MD005, MD007) to relax doc formatting checks.

Sequence Diagram(s)

sequenceDiagram
    participant User as User (CLI)
    participant Script as Deploy Script
    participant Checkpoint as Checkpoint (Local / HF)
    participant Framework as Framework (vLLM / SGLang / TRT-LLM)
    participant Health as Health Endpoint (/health)

    User->>Script: run `deploy.sh start --model ... --framework ...`
    Script->>Checkpoint: detect_quantization()
    Checkpoint-->>Script: return quantization (modelopt / modelopt_fp4 / none)
    Script->>Framework: launch with chosen flags/command
    Framework-->>Script: PID / process started
    Script->>Health: poll /health until ready
    Health-->>Script: healthy
    Script-->>User: report endpoint and status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Agent Deployment skill for model serving' directly and clearly describes the main change: introducing a new Claude skill for deploying LLM models as API endpoints.
Security Anti-Patterns ✅ Passed PR adds only documentation, shell scripts, and configuration files with no Python code. Shell script uses safe bash array construction and passes user input as command-line arguments, avoiding injection risks. No eval/exec on external input, torch.load(), numpy.load(), trust_remote_code=True patterns, or new dependencies present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/agent-deployment

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1133/

Built to branch gh-pages at 2026-03-30 04:03 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/deployment/references/setup.md:
- Around line 77-84: The example run command launches
vllm.entrypoints.openai.api_server with --model <checkpoint_path> but does not
bind-mount the local checkpoint into the container; update the docker run
invocation to include a bind mount (e.g., -v /path/on/host:/model_dir) and
change the --model argument to the checkpoint path inside the container (e.g.,
/model_dir/<checkpoint_file>) so the process in the container can access the
checkpoint.

In @.claude/skills/deployment/scripts/deploy.sh:
- Line 67: The usage text and dispatcher currently advertise and accept "trtllm"
as a FRAMEWORK but start_trtllm() only prints manual instructions and exits
non-zero, causing start --framework trtllm to always fail; remove "trtllm" from
the supported framework list in usage() and the framework-dispatch paths (places
that parse --framework) or implement a real startup flow in start_trtllm() so it
behaves like start_vllm()/start_sglang(); specifically update the usage() string
(the line showing "--framework FRAMEWORK"), adjust any framework-dispatch logic
that treats "trtllm" as first-class, and either fully implement start_trtllm()
or stop advertising "trtllm" until it is implemented.
- Around line 103-119: The current check around quant_config (variable
quant_config) and the inline python snippet only tests for "fp4" and maps
everything else to "modelopt", which misclassifies other quant_algos (e.g.,
INT4_AWQ, W4A8_AWQ); modify the logic in the python snippet (or add a small
wrapper) to parse cfg.get('quantization', {}).get('quant_algo', '') into a
variable and explicitly branch: if quant_algo contains "fp4" echo
"modelopt_fp4", else if quant_algo matches known TRT-only/other formats like
"INT4_AWQ" or "W4A8_AWQ" echo a distinct value (or exit non‑zero) so they don’t
get treated as modelopt, otherwise echo "modelopt" (or fail fast for unknown
values); update the echo outputs and error handling accordingly so functions
relying on those outputs (the inline python call and the script logic that
consumes "modelopt_fp4"/"modelopt") can correctly select the backend.
- Around line 173-179: The server.meta write currently injects raw user values
into META_FILE (variable META_FILE/server.meta) and later is sourced, causing
command injection and loss of metadata on restart; change the META_FILE writer
(the block that writes FRAMEWORK, MODEL, PORT, QUANTIZATION, TP_SIZE) to store
safely escaped/quoted values or a non-executable format (e.g., JSON or
POSIX-safe key="value" lines with proper escaping) and stop using shell source
to read it; update the reader (where META_FILE is sourced) to parse the file
safely (e.g., read with IFS and parameter expansion or a JSON parser) instead of
sourcing, and modify stop_server so it does not delete META_FILE (or ensure
start_server reads/restores metadata from META_FILE before any deletion) so
restart (the restart command path that calls stop_server then start_server) can
recover required --model and other args.
- Around line 89-123: The detect_quantization function is emitting informational
log_info() lines to stdout which pollute the captured output used by callers
(e.g., QUANTIZATION=$(detect_quantization "$MODEL")); update the three
log_info() calls inside detect_quantization so they no longer write to
stdout—either remove them or make them write to stderr instead—so the function
prints only the quantization token (modelopt/modelopt_fp4/none) to stdout; after
the change ensure callers that parse detect_quantization (the QUANTIZATION
assignment and subsequent --quantization use) receive a single clean value.
- Around line 216-217: The startup sequence uses nohup "${cmd[@]}" >"$LOG_FILE"
2>&1 & and then writes echo $! >"$PID_FILE", which does not guarantee the saved
PID is a process group leader; change the invocation to use setsid so the
launched process becomes a proper process group leader (use nohup setsid
"${cmd[@]}" >"$LOG_FILE" 2>&1 &), and apply this change in both start routines
that reference the cmd array, LOG_FILE and PID_FILE (the vLLM and SGLang start
functions) so that subsequent kill -- -"$pid" targets the correct process group.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a63104f3-2e62-4c24-bd61-fe69b4b4fdf8

📥 Commits

Reviewing files that changed from the base of the PR and between 24ceba6 and 4b39327.

📒 Files selected for processing (8)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/deployment/references/setup.md
  • .claude/skills/deployment/references/sglang.md
  • .claude/skills/deployment/references/support-matrix.md
  • .claude/skills/deployment/references/trtllm.md
  • .claude/skills/deployment/references/vllm.md
  • .claude/skills/deployment/scripts/deploy.sh
  • .markdownlint-cli2.yaml

Comment on lines +77 to +84
```bash
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .

docker run --gpus all -p 8000:8000 vllm-modelopt \
python -m vllm.entrypoints.openai.api_server \
--model <checkpoint_path> \
--quantization modelopt \
--host 0.0.0.0 --port 8000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mount the checkpoint into the container in this example.

This command tells vLLM to load <checkpoint_path>, but the container only sees its own filesystem. Without a bind mount here, the documented local-checkpoint flow fails unless the checkpoint is already baked into the image.

💡 Suggested doc fix
 docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .
 
-docker run --gpus all -p 8000:8000 vllm-modelopt \
+docker run --gpus all -p 8000:8000 \
+    -v <checkpoint_dir>:<checkpoint_dir> \
+    vllm-modelopt \
     python -m vllm.entrypoints.openai.api_server \
         --model <checkpoint_path> \
         --quantization modelopt \
         --host 0.0.0.0 --port 8000
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/references/setup.md around lines 77 - 84, The
example run command launches vllm.entrypoints.openai.api_server with --model
<checkpoint_path> but does not bind-mount the local checkpoint into the
container; update the docker run invocation to include a bind mount (e.g., -v
/path/on/host:/model_dir) and change the --model argument to the checkpoint path
inside the container (e.g., /model_dir/<checkpoint_file>) so the process in the
container can access the checkpoint.


Options:
--model PATH Model path or HF model ID (required for start)
--framework FRAMEWORK vllm, sglang, or trtllm (default: vllm)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't advertise trtllm as a supported --framework yet.

usage() and the dispatcher treat TRT-LLM as a first-class backend, but start_trtllm() only prints manual instructions and exits 1. As written, the documented start --framework trtllm flow always fails.

Also applies to: 182-190, 239-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh at line 67, The usage text and
dispatcher currently advertise and accept "trtllm" as a FRAMEWORK but
start_trtllm() only prints manual instructions and exits non-zero, causing start
--framework trtllm to always fail; remove "trtllm" from the supported framework
list in usage() and the framework-dispatch paths (places that parse --framework)
or implement a real startup flow in start_trtllm() so it behaves like
start_vllm()/start_sglang(); specifically update the usage() string (the line
showing "--framework FRAMEWORK"), adjust any framework-dispatch logic that
treats "trtllm" as first-class, and either fully implement start_trtllm() or
stop advertising "trtllm" until it is implemented.

Comment on lines +216 to +217
nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 &
echo $! >"$PID_FILE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '216,217p;234,235p;303,318p' .claude/skills/deployment/scripts/deploy.sh

echo
bash -lc '
sleep 30 &
pid=$!
ps -o pid=,pgid=,comm= -p $$ -p $pid
kill $pid
'

Repository: NVIDIA/Model-Optimizer

Length of output: 867


🏁 Script executed:

cat -n .claude/skills/deployment/scripts/deploy.sh | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2001


🏁 Script executed:

# Get full context around the start and stop functions
sed -n '200,240p; 290,320p' .claude/skills/deployment/scripts/deploy.sh | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2432


🏁 Script executed:

# Test process group behavior in non-interactive bash (as scripts run)
bash -c '
echo "=== Test 1: nohup with background ==="
nohup bash -c "sleep 30; echo done" >/tmp/test1.log 2>&1 &
pid=$!
echo "Parent shell PGID: $$"
echo "Child PID: $pid"
# Cannot use ps in this sandbox, so check what kill would do
echo "Would kill -0 \$pid work: kill -0 $pid && echo yes || echo no"
kill -0 $pid && echo "Process exists" || echo "Process does not exist"
kill $pid 2>/dev/null

echo
echo "=== Test 2: What group ID would kill -- -PID target? ==="
bash -c "sleep 30 &
pid=\$!
echo Attempted to save PID: \$pid
# Show what we are trying to kill
echo Would use: kill -- -\$pid" &
sleep 0.5
'

Repository: NVIDIA/Model-Optimizer

Length of output: 315


🏁 Script executed:

# Check the actual code to see if there's any special handling of process groups
rg "setsid|setpgid|set -m" .claude/skills/deployment/scripts/deploy.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at the server startup command - what does cmd contain?
sed -n '200,220p' .claude/skills/deployment/scripts/deploy.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 595


Use setsid to make the started process a proper process group leader.

The current code with nohup ... & followed by echo $! >"$PID_FILE" does not guarantee the saved PID is a process group leader in non-interactive bash (job control is disabled by default in scripts). This means kill -- -"$pid" will target the wrong group or fail silently, leaving forked child processes orphaned even though the fallback prevents an error.

Wrap the startup command with setsid:

nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &

This applies to both vLLM (lines 216–217) and SGLang (lines 234–235) start functions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 216 - 217, The
startup sequence uses nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 & and then writes echo
$! >"$PID_FILE", which does not guarantee the saved PID is a process group
leader; change the invocation to use setsid so the launched process becomes a
proper process group leader (use nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &),
and apply this change in both start routines that reference the cmd array,
LOG_FILE and PID_FILE (the vLLM and SGLang start functions) so that subsequent
kill -- -"$pid" targets the correct process group.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (24ceba6) to head (c8d55ce).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1133   +/-   ##
=======================================
  Coverage   70.18%   70.18%           
=======================================
  Files         230      230           
  Lines       26045    26045           
=======================================
  Hits        18279    18279           
  Misses       7766     7766           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add a Claude Code skill for serving quantized or unquantized LLM
checkpoints as OpenAI-compatible API endpoints. Supports vLLM, SGLang,
and TRT-LLM (including AutoDeploy).

Features:
- Auto-detect quantization format from hf_quant_config.json
- Framework recommendation based on model/format/use case
- GPU memory estimation and tensor parallelism guidance
- Server lifecycle management via deploy.sh script
- Health check verification and API testing
- Remote deployment via SSH/SLURM
- Optional throughput/latency benchmarking

Includes framework-specific references (vllm.md, sglang.md, trtllm.md),
support matrix, and setup guide.

Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/agent-deployment branch 2 times, most recently from c282546 to 5378a84 Compare March 29, 2026 19:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
.claude/skills/deployment/scripts/deploy.sh (1)

473-473: ⚠️ Potential issue | 🟠 Major

Restart command loses required --model argument.

When restart is invoked without arguments, stop_server deletes META_FILE, then start_server fails because --model is required but not recovered. Either preserve metadata across restart or require --model for restart.

🐛 Proposed fix — preserve metadata for restart
-    restart) stop_server; sleep 2; start_server ;;
+    restart)
+        # Preserve metadata before stop deletes it
+        if [[ -f "$META_FILE" && -z "$MODEL" ]]; then
+            MODEL=$(grep '^MODEL=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            FRAMEWORK=$(grep '^FRAMEWORK=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            PORT=$(grep '^PORT=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            QUANTIZATION=$(grep '^QUANTIZATION=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+            TP_SIZE=$(grep '^TP_SIZE=' "$META_FILE" | cut -d= -f2- | tr -d "'")
+        fi
+        stop_server
+        sleep 2
+        start_server
+        ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh at line 473, The restart command
currently calls stop_server then start_server which removes META_FILE in
stop_server so start_server cannot recover the required --model argument; modify
restart to preserve metadata by either (A) changing stop_server to not delete
META_FILE when invoked from restart (add a flag or mode parameter like
stop_server(restart_mode=true)) or (B) alter restart to read and pass the saved
model value to start_server (read META_FILE before calling stop_server and
forward the --model value into start_server). Update references in the restart
handler and the stop_server/start_server implementations accordingly and ensure
META_FILE and the --model handling logic remain consistent.
🧹 Nitpick comments (3)
.claude/skills/deployment/scripts/deploy.sh (2)

333-333: Unused loop variable i (shellcheck SC2034).

The variable i is declared but unused in the for loop. This is harmless but can be silenced.

🧹 Proposed fix
-    for i in {1..15}; do
+    for _ in {1..15}; do
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh at line 333, The for loop
currently declares an unused variable with "for i in {1..15}; do" which triggers
shellcheck SC2034; replace the unused loop variable with a throwaway name (e.g.,
"_" or "unused") or rewrite the loop as a while loop so no unused variable is
created; update the loop header where "for i in {1..15}; do" appears (and any
matching "done") so the body executes 15 times without referencing "i".

35-35: Unused variable SCRIPT_DIR.

Shellcheck flagged this as unused. Either remove it or add a comment if it's intended for future use.

🧹 Proposed fix
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"  # Reserved for future use

Or simply remove the line if not needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh at line 35, The variable
assignment SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" is unused;
either remove that line from deploy.sh or, if you intend to keep it as a
placeholder, add an explicit comment and silence ShellCheck (e.g., add a comment
like "intentionally unused placeholder" and a ShellCheck disable for SC2034) so
the unused-variable warning is resolved; refer to the SCRIPT_DIR assignment to
locate the change.
.claude/skills/deployment/SKILL.md (1)

137-145: Add FP4 note for SGLang similar to vLLM.

Line 135 notes "For NVFP4 checkpoints, use --quantization modelopt_fp4" for vLLM, but the SGLang section doesn't include a similar note. Add consistency.

📝 Suggested addition after line 145
For NVFP4 checkpoints, use `--quantization modelopt_fp4`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/SKILL.md around lines 137 - 145, The SGLang
launch_server docs are missing the FP4 note present in vLLM; update the SGLang
section that shows the python -m sglang.launch_server usage (the block with
--quantization modelopt) to add a line stating that for NVFP4 checkpoints you
should use --quantization modelopt_fp4 so the CLI examples are consistent with
the vLLM note.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 48-51: The log_* functions (log_info, log_success, log_warn,
log_error) currently print to stdout and therefore corrupt captured command
substitution outputs (e.g., calls in detect_quantization); change each function
to write to stderr by redirecting the printf output to >&2 so any use of $(...)
will only capture the command output and not log lines, preserving existing
messages and formats but routing them to stderr.
- Around line 127-134: The fallback JSON parsing that sets quant_method
incorrectly reads qc.get('quant_type', 'fp8'), causing ModelOpt-exported configs
to be missed; update the Python snippet that computes quant_method (the python3
-c block that writes to quant_method) to use qc.get('quant_algo', 'fp8') instead
of qc.get('quant_type', 'fp8') so ModelOpt's quantization field is recognized
when hf_quant_config.json is absent.

In @.claude/skills/deployment/SKILL.md:
- Around line 15-33: Update the example commands to use the
repository-root-correct path or add a working-directory note: replace
occurrences of scripts/deploy.sh with
.claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four
start/test/status/stop commands) or add a single-line note above the code block
stating the examples assume you run them from .claude/skills/deployment/ so
users know to adjust the path when invoking scripts/deploy.sh.
- Around line 254-261: Update the table row for the `quantization="modelopt" not
recognized` error to require SGLang >= 0.5 instead of the current `SGLang >=
0.4.10`; locate the row by the exact cell text `quantization="modelopt" not
recognized` and replace the `SGLang >= 0.4.10` string with `SGLang >= 0.5`
(leave the vLLM entry `vLLM >= 0.10.1` as-is or change to a “latest version
recommended” note if desired).

---

Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Line 473: The restart command currently calls stop_server then start_server
which removes META_FILE in stop_server so start_server cannot recover the
required --model argument; modify restart to preserve metadata by either (A)
changing stop_server to not delete META_FILE when invoked from restart (add a
flag or mode parameter like stop_server(restart_mode=true)) or (B) alter restart
to read and pass the saved model value to start_server (read META_FILE before
calling stop_server and forward the --model value into start_server). Update
references in the restart handler and the stop_server/start_server
implementations accordingly and ensure META_FILE and the --model handling logic
remain consistent.

---

Nitpick comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Line 333: The for loop currently declares an unused variable with "for i in
{1..15}; do" which triggers shellcheck SC2034; replace the unused loop variable
with a throwaway name (e.g., "_" or "unused") or rewrite the loop as a while
loop so no unused variable is created; update the loop header where "for i in
{1..15}; do" appears (and any matching "done") so the body executes 15 times
without referencing "i".
- Line 35: The variable assignment SCRIPT_DIR="$(cd "$(dirname
"${BASH_SOURCE[0]}")" && pwd)" is unused; either remove that line from deploy.sh
or, if you intend to keep it as a placeholder, add an explicit comment and
silence ShellCheck (e.g., add a comment like "intentionally unused placeholder"
and a ShellCheck disable for SC2034) so the unused-variable warning is resolved;
refer to the SCRIPT_DIR assignment to locate the change.

In @.claude/skills/deployment/SKILL.md:
- Around line 137-145: The SGLang launch_server docs are missing the FP4 note
present in vLLM; update the SGLang section that shows the python -m
sglang.launch_server usage (the block with --quantization modelopt) to add a
line stating that for NVFP4 checkpoints you should use --quantization
modelopt_fp4 so the CLI examples are consistent with the vLLM note.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b627486a-bec6-4491-9f41-909c529fd071

📥 Commits

Reviewing files that changed from the base of the PR and between 4b39327 and 7a5ce0b.

📒 Files selected for processing (8)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/deployment/references/setup.md
  • .claude/skills/deployment/references/sglang.md
  • .claude/skills/deployment/references/support-matrix.md
  • .claude/skills/deployment/references/trtllm.md
  • .claude/skills/deployment/references/vllm.md
  • .claude/skills/deployment/scripts/deploy.sh
  • .markdownlint-cli2.yaml
✅ Files skipped from review due to trivial changes (6)
  • .markdownlint-cli2.yaml
  • .claude/skills/deployment/references/trtllm.md
  • .claude/skills/deployment/references/vllm.md
  • .claude/skills/deployment/references/setup.md
  • .claude/skills/deployment/references/sglang.md
  • .claude/skills/deployment/references/support-matrix.md

Comment on lines +15 to +33
```bash
# Start vLLM server with a ModelOpt checkpoint
scripts/deploy.sh start --model ./qwen3-0.6b-fp8

# Start with SGLang and tensor parallelism
scripts/deploy.sh start --model ./llama-70b-nvfp4 --framework sglang --tp 4

# Start from HuggingFace hub
scripts/deploy.sh start --model nvidia/Llama-3.1-8B-Instruct-FP8

# Test the API
scripts/deploy.sh test

# Check status
scripts/deploy.sh status

# Stop
scripts/deploy.sh stop
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify script path relative to repository root.

The examples use scripts/deploy.sh but the script is at .claude/skills/deployment/scripts/deploy.sh. Users running from repo root would need the full path.

📝 Suggested clarification

Either update paths to be absolute from repo root:

.claude/skills/deployment/scripts/deploy.sh start --model ./qwen3-0.6b-fp8

Or add a note that examples assume the working directory is .claude/skills/deployment/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/SKILL.md around lines 15 - 33, Update the example
commands to use the repository-root-correct path or add a working-directory
note: replace occurrences of scripts/deploy.sh with
.claude/skills/deployment/scripts/deploy.sh in the examples (e.g., the four
start/test/status/stop commands) or add a single-line note above the code block
stating the examples assume you run them from .claude/skills/deployment/ so
users know to adjust the path when invoking scripts/deploy.sh.

Comment on lines +254 to +261
| Error | Cause | Fix |
|-------|-------|-----|
| `CUDA out of memory` | Model too large for GPU(s) | Increase `--tensor-parallel-size` or use a smaller model |
| `quantization="modelopt" not recognized` | vLLM/SGLang version too old | Upgrade: vLLM >= 0.10.1, SGLang >= 0.4.10 |
| `hf_quant_config.json not found` | Not a ModelOpt-exported checkpoint | Re-export with `export_hf_checkpoint()`, or remove `--quantization` flag |
| `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors |
| `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

vLLM modelopt quantization minimum version requirement

💡 Result:

vLLM's ModelOpt quantization requires a minimum CUDA compute capability of 8.9 (Hopper GPUs like H100 and newer). There is no explicit minimum vLLM software version requirement stated in the official documentation; support is available in recent versions as evidenced by the current docs (published 2025-10-23). ModelOpt quantization was added before April 2025, with extensions like Mixtral support merged around then (PR #15961). Use the latest vLLM version for full compatibility.

Citations:


🌐 Web query:

SGLang modelopt quantization version requirement minimum

💡 Result:

SGLang's native ModelOpt quantization support requires SGLang version 0.5 or later. This feature was introduced via merged pull requests #7149 (merged October 6, 2025), #9991 (merged October 11, 2025), and #10154, with full integration announced in the LMSYS blog post on December 2, 2025. Official documentation lists --quantization modelopt, --quantization modelopt_fp8, --quantization modelopt_fp4, and --modelopt-quant (e.g., 'fp8', 'nvfp4') as supported options, requiring installation of nvidia-modelopt via pip install nvidia-modelopt. Earlier versions had partial ModelOpt support (e.g., PR #5145 merged April 8, 2025, for auto-detection), but native quantization workflow mandates post-October 2025 releases. Use the latest version (e.g., v0.5.10rc0 as of March 2026) for optimal compatibility with Hopper/Blackwell GPUs.

Citations:


🌐 Web query:

vLLM modelopt quantization support version history when added

💡 Result:

vLLM added support for ModelOpt (NVIDIA Model Optimizer) quantization in September 2024 via PR #6112, merged on 2024-09-11. This introduced the "modelopt" quantization method for loading ModelOpt FP8 static scaling checkpoints, usable as LLM(model="path", quantization="modelopt"). It requires CUDA compute capability 8.9+ (Hopper GPUs). Subsequent enhancements include KV cache scaling support (PR #11787, merged Jan 2025) and MXFP8/MXFP4 support in later versions (e.g., v0.14.0+ for some formats). ModelOpt supports FP8 variants like per-tensor, per-channel-per-token, FP8_PB_WO, and mixed precision with NVFP4, using torch._scaled_mm for static weight/activation scales.

Citations:


Correct SGLang version requirement for ModelOpt quantization support.

The error table states SGLang >= 0.4.10 for --quantization modelopt support, but documentation indicates SGLang >= 0.5 is required. Native ModelOpt quantization support was introduced in SGLang v0.5 (full integration December 2025). Update line 257 to reflect: SGLang >= 0.5.

For vLLM, while official documentation does not explicitly state a minimum version, v0.10.1 is a reasonable conservative estimate given ModelOpt support was added in September 2024; this can remain as-is or be updated to "latest version recommended" if you prefer to match vLLM's official guidance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/SKILL.md around lines 254 - 261, Update the table
row for the `quantization="modelopt" not recognized` error to require SGLang >=
0.5 instead of the current `SGLang >= 0.4.10`; locate the row by the exact cell
text `quantization="modelopt" not recognized` and replace the `SGLang >= 0.4.10`
string with `SGLang >= 0.5` (leave the vLLM entry `vLLM >= 0.10.1` as-is or
change to a “latest version recommended” note if desired).

@kaix-nv kaix-nv force-pushed the kaix/agent-deployment branch from 5378a84 to 3b8a62e Compare March 30, 2026 03:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
.claude/skills/deployment/scripts/deploy.sh (3)

48-51: ⚠️ Potential issue | 🔴 Critical

Send log output to stderr.

These helpers are still used inside $(...), so gpu_count=$(detect_gpu) and QUANTIZATION=$(detect_quantization ...) capture the log lines along with the real value. That makes the TP comparison non-numeric and can pass a multi-line --quantization value downstream.

🐛 Proposed fix
-log_info()    { printf "${BLUE}[INFO]${NC} %s\n" "$1"; }
-log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1"; }
-log_warn()    { printf "${YELLOW}[WARN]${NC} %s\n" "$1"; }
-log_error()   { printf "${RED}[ERROR]${NC} %s\n" "$1"; }
+log_info()    { printf "${BLUE}[INFO]${NC} %s\n" "$1" >&2; }
+log_success() { printf "${GREEN}[OK]${NC} %s\n" "$1" >&2; }
+log_warn()    { printf "${YELLOW}[WARN]${NC} %s\n" "$1" >&2; }
+log_error()   { printf "${RED}[ERROR]${NC} %s\n" "$1" >&2; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 48 - 51, The log
helper functions log_info, log_success, log_warn, and log_error are writing to
stdout and polluting command substitutions like gpu_count=$(detect_gpu); change
each helper to write to stderr instead (so logs don't get captured in $(...)).
Update the implementations of log_info/log_success/log_warn/log_error to direct
their printf output to file descriptor 2 (stderr) so command substitutions only
capture the actual function return/echoed values, not the log lines.

142-145: ⚠️ Potential issue | 🟠 Major

Read quant_algo in the config.json fallback.

This branch still looks up quantization_config.quant_type, but the exported ModelOpt metadata uses quant_algo. FP4 checkpoints that only have config.json will therefore fall back to the default and be launched as modelopt instead of modelopt_fp4.

🐛 Proposed fix
 if qc.get('quant_method') == 'modelopt':
-    print(qc.get('quant_type', 'fp8'))
+    print(qc.get('quant_algo', 'fp8'))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 142 - 145, The
script currently checks qc.get('quant_type') when qc.get('quant_method') ==
'modelopt' but the ModelOpt metadata uses the key 'quant_algo' in config.json,
causing fp4 checkpoints to fall back incorrectly; update the fallback read to
also read 'quant_algo' from the config JSON (e.g., parse model_path/config.json
into a dict and check ['quant_algo'] if qc lacks 'quant_type') and map/translate
that value into the expected qc['quant_type'] (e.g., treat 'fp4' as
'modelopt_fp4') before launching; adjust the branch that references qc,
quant_method, and quant_type so it prefers qc['quant_type'] but falls back to
config['quant_algo'] mapped appropriately.

382-383: ⚠️ Potential issue | 🟠 Major

Make restart reload the previous deployment config.

The normal stop path deletes server.meta, and the restart branch immediately calls start_server with empty globals. ./deploy.sh restart therefore still falls into --model is required unless the caller repeats the full start command.

Also applies to: 542-542

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 382 - 383, The
restart path currently removes "$META_FILE" (server.meta) before calling
start_server, so restart loses the previous deployment config and triggers the
"--model is required" error; modify the restart logic to preserve or reload the
existing META_FILE when performing a restart (i.e., do not rm -f "$META_FILE" in
the restart branch or copy/restore it into the expected globals before calling
start_server) and ensure start_server reads globals from META_FILE if present
(referencing PID_FILE, META_FILE, start_server, and log_success to locate the
logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 389-396: The force-kill path currently logs an error if ps -p
"$pid" still finds the process but then proceeds to rm -f "$PID_FILE"
"$META_FILE" and continue; change this so that after running kill -9 (both group
and pid attempts) you check ps -p "$pid" and if the process still exists call
log_error and exit non-zero (e.g. exit 1) without removing "$PID_FILE" or
"$META_FILE" or proceeding further; reference the existing kill -9 -- -"$pid" /
kill -9 "$pid", the ps -p "$pid" check, log_error, and the PID_FILE/META_FILE
removals to locate where to add the early exit on failure.
- Around line 343-344: The health-check curl invocations in deploy.sh (the if
that runs curl -s "http://localhost:$PORT/health") treat non-2xx HTTP responses
as success; modify those curl calls to use the -f/--fail flag (e.g., change curl
-s to curl -sf) so curl exits non-zero on HTTP 400+ responses, and apply the
same change to the other health check curl block referenced around lines 411-415
so both checks properly fail on non-2xx responses.

---

Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 48-51: The log helper functions log_info, log_success, log_warn,
and log_error are writing to stdout and polluting command substitutions like
gpu_count=$(detect_gpu); change each helper to write to stderr instead (so logs
don't get captured in $(...)). Update the implementations of
log_info/log_success/log_warn/log_error to direct their printf output to file
descriptor 2 (stderr) so command substitutions only capture the actual function
return/echoed values, not the log lines.
- Around line 142-145: The script currently checks qc.get('quant_type') when
qc.get('quant_method') == 'modelopt' but the ModelOpt metadata uses the key
'quant_algo' in config.json, causing fp4 checkpoints to fall back incorrectly;
update the fallback read to also read 'quant_algo' from the config JSON (e.g.,
parse model_path/config.json into a dict and check ['quant_algo'] if qc lacks
'quant_type') and map/translate that value into the expected qc['quant_type']
(e.g., treat 'fp4' as 'modelopt_fp4') before launching; adjust the branch that
references qc, quant_method, and quant_type so it prefers qc['quant_type'] but
falls back to config['quant_algo'] mapped appropriately.
- Around line 382-383: The restart path currently removes "$META_FILE"
(server.meta) before calling start_server, so restart loses the previous
deployment config and triggers the "--model is required" error; modify the
restart logic to preserve or reload the existing META_FILE when performing a
restart (i.e., do not rm -f "$META_FILE" in the restart branch or copy/restore
it into the expected globals before calling start_server) and ensure
start_server reads globals from META_FILE if present (referencing PID_FILE,
META_FILE, start_server, and log_success to locate the logic).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3d11a608-5df5-494b-b088-2a22e54bb58f

📥 Commits

Reviewing files that changed from the base of the PR and between 5378a84 and 3b8a62e.

📒 Files selected for processing (3)
  • .claude/skills/deployment/evals/remote-slurm-deployment.json
  • .claude/skills/deployment/evals/vllm-fp8-local.json
  • .claude/skills/deployment/scripts/deploy.sh
✅ Files skipped from review due to trivial changes (2)
  • .claude/skills/deployment/evals/remote-slurm-deployment.json
  • .claude/skills/deployment/evals/vllm-fp8-local.json

Comment on lines +343 to +344
if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
log_success "Server is ready! (${elapsed}s)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "deploy.sh" -type f | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1895


🌐 Web query:

curl -s flag behavior non-2xx responses exit code 404 503

💡 Result:

The -s or --silent flag in curl makes it silent by suppressing the progress meter and error messages. It does not affect the exit code. By default, curl exits with code 0 for any HTTP response (including non-2xx like 404 or 503), as long as the transfer completes successfully, and outputs the response body to stdout. To make curl exit with a non-zero code (22) on HTTP 400+ responses (including 404 and 503): - Use -f or --fail: suppresses response body output on error. - Use --fail-with-body (curl 7.75.0+): outputs response body but still exits with 22. With -s alone, for 404/503 you get exit code 0, no progress meter, no error msg, but response body printed. Example: curl -s http://example.com/404 # exits 0, prints 404 page silently curl -s -f http://example.com/404 # exits 22, no output (silent due to -s) curl -s --fail-with-body http://example.com/404 # exits 22, prints 404 page (no error msg due to -s) To see errors with -s, combine with -S or --show-error: shows error messages on non-zero exit but keeps silent otherwise. Official sources confirm: curl considers HTTP response receipt a success unless --fail used; --silent only mutes progress/error display, not exit behavior.

Citations:


🌐 Web query:

curl -f -s -S flags meaning --fail --silent --show-error

💡 Result:

The curl flags -f -s -S correspond to --fail --silent --show-error. - -f or --fail: (HTTP) Fail silently (no output at all on stdout) on HTTP server errors (response codes >=400). curl exits with error code 22 instead of outputting the error page. Not fail-safe for auth-related codes like 401/407. - -s or --silent: Silent mode. Hides the progress meter and error messages. - -S or --show-error: When used with -s, shows error messages on stderr despite silence mode. This combination is commonly used in scripts: be quiet normally, but show errors on failure (e.g., to stderr) and exit non-zero on HTTP errors. Example usage detects failures without progress bars but displays error details.

Citations:


🌐 Web query:

curl --max-time timeout health check endpoint reasonable value

💡 Result:

A reasonable value for --max-time in curl for health check endpoints is 10 seconds. This provides sufficient time for typical health checks to complete while preventing hangs on unresponsive services. Pair it with --connect-timeout 5 for connection establishment. Common examples from sources: - Docker HEALTHCHECK: --timeout=10s with curl --max-time effectively 10s (19,9) - General health checks: --max-time 10-15s, --connect-timeout 5s (5,11) - Load balancer contexts: Aligns with AWS ALB default HealthCheckTimeoutSeconds=5s (extend for safety), NGINX defaults ~5-10s intervals (14,12,16) Example command: curl --connect-timeout 5 --max-time 10 -f -s http://localhost/health || exit 1 Adjust based on endpoint response time: 2x expected max response time, minimum 5-10s.

Citations:


Treat non-2xx /health responses as failures.

The curl -s commands exit with code 0 on HTTP error responses (404, 503), allowing the health checks to incorrectly report success. Use -f to make curl exit with non-zero on HTTP 400+ errors.

Proposed fix
-        if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+        if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
             log_success "Server is ready! (${elapsed}s)"
             return 0
         fi
@@
-    if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+    if ! curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
         log_error "Server not responding at port $PORT"
         exit 1
     fi

Also applies to lines 411-415.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
log_success "Server is ready! (${elapsed}s)"
if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
log_success "Server is ready! (${elapsed}s)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 343 - 344, The
health-check curl invocations in deploy.sh (the if that runs curl -s
"http://localhost:$PORT/health") treat non-2xx HTTP responses as success; modify
those curl calls to use the -f/--fail flag (e.g., change curl -s to curl -sf) so
curl exits non-zero on HTTP 400+ responses, and apply the same change to the
other health check curl block referenced around lines 411-415 so both checks
properly fail on non-2xx responses.

Comment on lines +389 to +396
# Force kill
log_warn "Force killing..."
kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true
sleep 1
if ps -p "$pid" >/dev/null 2>&1; then
log_error "Failed to kill server process $pid — manual intervention required"
fi
rm -f "$PID_FILE" "$META_FILE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't report success after a failed force-kill.

If the process is still alive after kill -9, this path logs an error but then continues as if the stop succeeded. That leaves a live server untracked and makes the next deploy fail later with a confusing port conflict.

🐛 Proposed fix
     sleep 1
     if ps -p "$pid" >/dev/null 2>&1; then
         log_error "Failed to kill server process $pid — manual intervention required"
+        return 1
     fi
     rm -f "$PID_FILE" "$META_FILE"

Also applies to: 404-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 389 - 396, The
force-kill path currently logs an error if ps -p "$pid" still finds the process
but then proceeds to rm -f "$PID_FILE" "$META_FILE" and continue; change this so
that after running kill -9 (both group and pid attempts) you check ps -p "$pid"
and if the process still exists call log_error and exit non-zero (e.g. exit 1)
without removing "$PID_FILE" or "$META_FILE" or proceeding further; reference
the existing kill -9 -- -"$pid" / kill -9 "$pid", the ps -p "$pid" check,
log_error, and the PID_FILE/META_FILE removals to locate where to add the early
exit on failure.

@kaix-nv kaix-nv force-pushed the kaix/agent-deployment branch from 3b8a62e to 6be8e1f Compare March 30, 2026 03:23
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
.claude/skills/deployment/scripts/deploy.sh (4)

389-396: ⚠️ Potential issue | 🟠 Major

Don't clear state after a failed force-kill.

If the process is still alive after kill -9, this path logs an error and then immediately deletes server.pid/server.meta and reports success. That leaves a live server untracked and turns the next start or restart into a confusing port conflict.

Suggested fix
     sleep 1
     if ps -p "$pid" >/dev/null 2>&1; then
         log_error "Failed to kill server process $pid — manual intervention required"
+        return 1
     fi
     rm -f "$PID_FILE" "$META_FILE"

Also applies to: 404-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 389 - 396, The
post-force-kill path currently removes the PID_FILE and META_FILE regardless of
whether the process identified by $pid was actually killed; when ps -p "$pid"
indicates the process is still alive you must not delete state. Change the logic
around the kill/check in the shutdown sequence (the block with log_warn, kill -9
-- -"$pid", ps -p "$pid", log_error) so that rm -f "$PID_FILE" "$META_FILE" only
runs when ps -p "$pid" confirms the process is gone (i.e., on successful kill),
and leave the files intact and return a non-success exit (or propagate error)
when log_error is triggered, to avoid orphaning a live server with removed
state.

67-67: ⚠️ Potential issue | 🟠 Major

Stop advertising trtllm as a working --framework.

start --framework trtllm still bottoms out in start_trtllm() printing manual steps and returning 1, so the documented CLI path is a guaranteed failure. That also conflicts with .claude/skills/deployment/SKILL.md:112-158 and .claude/skills/deployment/references/trtllm.md:1-56, which present TRT-LLM as a supported deployment path.

Also applies to: 244-246, 312-336

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh at line 67, The help string and
documentation advertise "--framework trtllm" as a supported option even though
start_trtllm() merely prints manual steps and returns failure; either remove
"trtllm" from the CLI option list and all help/docs (references: the help string
in deploy.sh, SKILL.md sections, and references/trtllm.md) or implement
start_trtllm() to perform a proper deployment and return success; update every
occurrence of "trtllm" in the deploy.sh help (--framework) and the duplicated
locations noted so help/docs and the start_trtllm() behavior remain consistent.

343-345: ⚠️ Potential issue | 🟠 Major

Make /health probes fail on HTTP errors.

curl -s exits 0 for HTTP 404/503, so readiness and test can report success while the server is still unhealthy. Use --fail and a bounded timeout for both probes.

Suggested fix
-        if curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+        if curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then
@@
-    if ! curl -s "http://localhost:$PORT/health" >/dev/null 2>&1; then
+    if ! curl -fsS --max-time 5 "http://localhost:$PORT/health" >/dev/null; then

Also applies to: 411-415

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 343 - 345, The
readiness probe currently uses curl -s which returns success on HTTP error
codes; update the curl calls that check "http://localhost:$PORT/health" (the if
curl -s ... then block and the similar probe around lines 411-415) to use --fail
plus a bounded timeout (e.g. --max-time) and keep --silent/--show-error as
desired so curl exits non‑zero on HTTP 4xx/5xx; leave the surrounding logic
(log_success, return 0, PORT) intact and apply the same change to both probe
locations.

271-272: ⚠️ Potential issue | 🟠 Major

Launch the server in its own process group.

stop_server() relies on kill -- -"$pid", but nohup ... & does not guarantee that $! is a process-group leader in non-interactive Bash. If vLLM or SGLang fork workers, the fallback only kills the parent and can leave serving processes behind.

Suggested fix
-    nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 &
+    nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &
@@
-    nohup "${cmd[@]}" >"$LOG_FILE" 2>&1 &
+    nohup setsid "${cmd[@]}" >"$LOG_FILE" 2>&1 &

Also applies to: 298-299, 377-377, 391-391

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 271 - 272, Start
the server so it becomes a process-group leader (so stop_server() can kill the
whole group) by replacing the plain nohup invocation with a process-group-aware
launch (e.g., use setsid or start-stop-daemon to start "${cmd[@]}" so the
backgrounded $! is the group leader) and ensure you still redirect output to
LOG_FILE and record the PID to PID_FILE; apply the same change for each
occurrence that currently uses nohup (the blocks that build cmd and write to
LOG_FILE/PID_FILE). Ensure stop_server() continues to read PID_FILE and use kill
-- -"$pid" against the recorded PID (which must be the group leader).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 182-190: is_server_running() currently trusts any live PID read
from PID_FILE and may act on a reused PID; update it to verify the process
identity before returning success: after reading PID from PID_FILE, check that
the process command line or executable matches the expected server (e.g.,
compare ps -p "$pid" -o cmd= or /proc/"$pid"/cmdline against the known
$SERVER_CMD or $APP_NAME) or alternatively verify the PID is listening on the
expected port (ss/lsof). Do the same validation in stop_server() so it only
signals the verified server PID; apply the same change to the other similar
block referenced (around stop logic at lines 372-377) to avoid killing unrelated
processes.
- Around line 227-233: The server metadata writes currently strip single quotes
and lose exact values (see variables META_FILE, FRAMEWORK, MODEL, PORT,
QUANTIZATION, TP_SIZE) so round-tripping and explicit restart flags break;
change the metadata format to a non-lossy representation (e.g., JSON or
shell-quoted via printf %q) and write the exact variable values to META_FILE,
and update the restart/merge logic so it reads that metadata and merges it with
CLI args with CLI options taking precedence (i.e., only use stored values for
options not explicitly provided). Apply the same fix to the other metadata write
sites in this script (the other server.meta write occurrences) so all
reads/writes are lossless and merged consistently.

---

Duplicate comments:
In @.claude/skills/deployment/scripts/deploy.sh:
- Around line 389-396: The post-force-kill path currently removes the PID_FILE
and META_FILE regardless of whether the process identified by $pid was actually
killed; when ps -p "$pid" indicates the process is still alive you must not
delete state. Change the logic around the kill/check in the shutdown sequence
(the block with log_warn, kill -9 -- -"$pid", ps -p "$pid", log_error) so that
rm -f "$PID_FILE" "$META_FILE" only runs when ps -p "$pid" confirms the process
is gone (i.e., on successful kill), and leave the files intact and return a
non-success exit (or propagate error) when log_error is triggered, to avoid
orphaning a live server with removed state.
- Line 67: The help string and documentation advertise "--framework trtllm" as a
supported option even though start_trtllm() merely prints manual steps and
returns failure; either remove "trtllm" from the CLI option list and all
help/docs (references: the help string in deploy.sh, SKILL.md sections, and
references/trtllm.md) or implement start_trtllm() to perform a proper deployment
and return success; update every occurrence of "trtllm" in the deploy.sh help
(--framework) and the duplicated locations noted so help/docs and the
start_trtllm() behavior remain consistent.
- Around line 343-345: The readiness probe currently uses curl -s which returns
success on HTTP error codes; update the curl calls that check
"http://localhost:$PORT/health" (the if curl -s ... then block and the similar
probe around lines 411-415) to use --fail plus a bounded timeout (e.g.
--max-time) and keep --silent/--show-error as desired so curl exits non‑zero on
HTTP 4xx/5xx; leave the surrounding logic (log_success, return 0, PORT) intact
and apply the same change to both probe locations.
- Around line 271-272: Start the server so it becomes a process-group leader (so
stop_server() can kill the whole group) by replacing the plain nohup invocation
with a process-group-aware launch (e.g., use setsid or start-stop-daemon to
start "${cmd[@]}" so the backgrounded $! is the group leader) and ensure you
still redirect output to LOG_FILE and record the PID to PID_FILE; apply the same
change for each occurrence that currently uses nohup (the blocks that build cmd
and write to LOG_FILE/PID_FILE). Ensure stop_server() continues to read PID_FILE
and use kill -- -"$pid" against the recorded PID (which must be the group
leader).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 112bd8c6-264a-4a6d-b23e-67cea4e2ee22

📥 Commits

Reviewing files that changed from the base of the PR and between 3b8a62e and 6be8e1f.

📒 Files selected for processing (3)
  • .claude/skills/deployment/evals/remote-slurm-deployment.json
  • .claude/skills/deployment/evals/vllm-fp8-local.json
  • .claude/skills/deployment/scripts/deploy.sh
✅ Files skipped from review due to trivial changes (2)
  • .claude/skills/deployment/evals/remote-slurm-deployment.json
  • .claude/skills/deployment/evals/vllm-fp8-local.json

Comment on lines +182 to +190
is_server_running() {
if [[ -f "$PID_FILE" ]]; then
local pid
pid=$(cat "$PID_FILE")
if ps -p "$pid" >/dev/null 2>&1; then
return 0
fi
fi
return 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't trust server.pid without verifying the process identity.

is_server_running() accepts any live PID, and stop_server() then signals it. If the original server died and the PID was reused, start will refuse to launch and stop can kill an unrelated process group. Validate something stable (expected command line, start time, or listening port) before treating the PID as this server.

Also applies to: 372-377

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/deployment/scripts/deploy.sh around lines 182 - 190,
is_server_running() currently trusts any live PID read from PID_FILE and may act
on a reused PID; update it to verify the process identity before returning
success: after reading PID from PID_FILE, check that the process command line or
executable matches the expected server (e.g., compare ps -p "$pid" -o cmd= or
/proc/"$pid"/cmdline against the known $SERVER_CMD or $APP_NAME) or
alternatively verify the PID is listening on the expected port (ss/lsof). Do the
same validation in stop_server() so it only signals the verified server PID;
apply the same change to the other similar block referenced (around stop logic
at lines 372-377) to avoid killing unrelated processes.

Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv force-pushed the kaix/agent-deployment branch from 6be8e1f to c8d55ce Compare March 30, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant