Skip to content

Add Agent Evaluation skill for accuracy benchmarking#1132

Open
kaix-nv wants to merge 3 commits intomainfrom
kaix/agent-evaluation
Open

Add Agent Evaluation skill for accuracy benchmarking#1132
kaix-nv wants to merge 3 commits intomainfrom
kaix/agent-evaluation

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Mar 28, 2026

What does this PR do?

Type of change: New feature

Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator Launcher (NEL). Based on the upstream nel-assistant skill with ModelOpt-specific additions:

  • Auto-detect ModelOpt quantization format from hf_quant_config.json (with config.json fallback) and set the correct vLLM/SGLang --quantization flag
  • Quantization-aware benchmark defaults — recommend MMLU, GSM8K, ARC-Challenge for quantized models (sensitive to precision loss)
  • Workspace management for multi-user environments (Step 0)
  • Progressive disclosure — model card research checklist and multi-node patterns extracted to references/ for on-demand loading

Skill structure

evaluation/
├── SKILL.md                              310 lines (core 8-step workflow)
├── references/
│   ├── model-card-research.md            Sampling params, reasoning config, ARM64, pre_cmd
│   └── multi-node.md                     HAProxy multi-instance, Ray TP/PP patterns
└── evals/
    └── nemotron3-nano-bf16-reasoning.json

The skill guides users through: NEL installation check → config generation via nel skills build-config → model card research → parameter tuning → task selection → multi-node setup → interceptors → execution with dry-run/test/full modes.

Depends on: #1107 (common files: remote_exec.sh, workspace-management.md, environment-setup.md)

Testing

Invoke in Claude Code:

claude -p "evaluate outputs/Qwen3-0.6B-FP8 on mmlu"

Before your PR is "Ready for review"

  • Is this change backward compatible?: N/A (new feature)
  • If you copied code from any other sources, did you follow guidance in CONTRIBUTING.md: ✅ (NEL skill attributed in frontmatter)
  • Did you write any new necessary tests?: N/A (skill evals provided separately)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Added comprehensive, step‑by‑step guidance for running model evaluations with NeMo Evaluator Launcher, model‑card research, multi‑node patterns, monitoring/SLURM tips, and environment/setup instructions.
  • New Features

    • Added many ready‑to‑use evaluation scenarios and configs: Nemotron, Qwen, reasoning/SGLang, quantized checkpoints (FP8/NVFP4), external deployments, interceptors/logging, workspace reuse, safety multilingual, code benchmarks with W&B export, and dry‑run/test flows.
  • Chores

    • Extended Markdown lint rules.

Add a Claude Code skill for evaluating LLM accuracy using NeMo Evaluator
Launcher (NEL). Based on the upstream nel-assistant skill (commit f1fa073)
with ModelOpt-specific additions:

- Auto-detect ModelOpt quantization format from hf_quant_config.json
  and set the correct vLLM/SGLang --quantization flag
- Quantization-aware benchmark defaults (recommend MMLU, GSM8K,
  ARC-Challenge for quantized models)
- Workspace management for multi-user environments (Step 0)
- Disable MD036/MD029 markdownlint rules for upstream NEL formatting

The skill guides users through NEL config generation, model card
research, and evaluation execution (local and SLURM).

Signed-off-by: Kai Xu <kaix@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

Adds a new NeMo evaluation skill and many evaluation manifests providing interactive, stepwise workflows for running accuracy evaluations via NeMo Evaluator Launcher (NEL), plus reference docs for model-card extraction and multi-node patterns; also updates Markdown lint rules.

Changes

Cohort / File(s) Summary
Skill doc & references
.claude/skills/evaluation/SKILL.md, .claude/skills/evaluation/references/model-card-research.md, .claude/skills/evaluation/references/multi-node.md
New primary skill documentation and reference guides covering end-to-end NEL workflow, model-card extraction, workspace reuse, multi-node patterns, pre_cmd notes, and interceptor placement.
Evaluation manifests (multiple scenarios)
.claude/skills/evaluation/evals/...
nemotron3-nano-bf16-reasoning.json, base-model-local-execution.json, external-deployment-eval.json, interceptor-configuration.json, multi-node-evaluation.json, nel-not-installed.json, nvfp4-auto-detect-quantization.json, quantized-checkpoint-local-vllm.json, reasoning-model-sglang.json, safety-multilingual-benchmarks.json, wandb-export-code-benchmarks.json, workspace-reuse-from-ptq.json
Added many evaluation JSON manifests for varied scenarios (local/external deployments, quantized checkpoints, NVFP4/FP8 auto-detection, reasoning setups, SGLang/SLURM, W&B export, interceptors, multi-node patterns, nel-not-installed checks, workspace reuse). Each manifest defines interactive expected behaviors and config-fill flows for nel skills build-config and nel run.
Lint configuration
.markdownlint-cli2.yaml
Disabled markdownlint rules MD029 and MD036 (ordered-list prefix format and emphasis-as-headings).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant Skill as Evaluation Skill
  participant NEL as NEL (CLI)
  participant Deploy as Runtime (vLLM / SGLang)
  participant Store as Model Storage / Workspace

  User->>Skill: invoke evaluation (provide model / choices)
  Skill->>NEL: run "nel skills build-config" (collect answers)
  NEL-->>Skill: generated config (with auto-detected quantization & overrides)
  Skill->>Store: read checkpoint & hf_quant_config.json (auto-detect)
  Skill->>Deploy: prepare deployment (pre_cmd, extra_args)
  Skill->>NEL: run "nel run" (dry-run -> test -> full)
  NEL->>Deploy: start/target evaluation jobs
  Deploy->>Store: load checkpoint
  Deploy-->>NEL: task results/logs
  NEL-->>Skill: status/info/logs
  Skill-->>User: present results / task list / monitoring pointers
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add Agent Evaluation skill for accuracy benchmarking' accurately describes the main change: introducing a new evaluation skill for benchmarking model accuracy using the NeMo Evaluator Launcher.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed This PR adds only documentation and JSON configuration files with no Python code changes, making security anti-patterns review not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/agent-evaluation

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1132/

Built to branch gh-pages at 2026-03-29 18:20 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 269-275: Update the SKILL.md snippet that documents
NEMO_EVALUATOR_TRUST_PRE_CMD to include a clear security warning: note that
setting NEMO_EVALUATOR_TRUST_PRE_CMD=1 enables execution of pre_cmd and post_cmd
which run arbitrary shell commands with the evaluator's privileges, instruct
users to review pre_cmd content, only trust configs from known sources, and be
cautious when using nemo_skills.* self-deployment tasks; reference the
environment variable name (NEMO_EVALUATOR_TRUST_PRE_CMD) and the config keys
(pre_cmd, post_cmd, nemo_skills.*) so readers can find and audit them.
- Around line 100-116: Update the documentation in SKILL.md to remove the
incorrect per-algorithm vLLM flag mapping and instead state that if
hf_quant_config.json exists (read quantization.quant_algo), vLLM uses a single
unified flag --quantization modelopt which auto-detects NVFP4, W4A8_AWQ, FP8,
etc.; replace the table and related lines with a concise statement: "If
hf_quant_config.json exists, vLLM auto-detects the quantization format and you
should pass --quantization modelopt (no format-specific flags required)."
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 697ac2e8-f1a5-41d4-9870-adb1ff730091

📥 Commits

Reviewing files that changed from the base of the PR and between 24ceba6 and bf4941b.

📒 Files selected for processing (3)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
  • .markdownlint-cli2.yaml

Comment on lines +100 to +116
**Auto-detect ModelOpt quantization format** (checkpoint paths only):

Check for `hf_quant_config.json` in the checkpoint directory:

```bash
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
```

If found, read `quantization.quant_algo` and set the correct vLLM/SGLang quantization flag in `deployment.extra_args`:

| `quant_algo` | Flag to add |
|-------------|-------------|
| `FP8` | `--quantization modelopt` |
| `W4A8_AWQ` | `--quantization modelopt` |
| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |

If no `hf_quant_config.json`, the checkpoint is unquantized — no flag needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for existing ModelOpt quantization format references in the codebase

echo "=== Searching for hf_quant_config.json references ==="
rg -n 'hf_quant_config' -C3

echo -e "\n=== Searching for quantization.quant_algo references ==="
rg -n 'quant_algo' -C3

echo -e "\n=== Searching for FP8, W4A8_AWQ, NVFP4 algorithm references ==="
rg -n 'FP8|W4A8_AWQ|NVFP4' -C2

Repository: NVIDIA/Model-Optimizer

Length of output: 50379


🌐 Web query:

What are the supported quantization formats in vLLM for ModelOpt quantized models? Specifically, does vLLM support --quantization modelopt and --quantization modelopt_fp4 flags?

💡 Result:

vLLM supports ModelOpt quantized models via the --quantization modelopt flag. This includes formats like FP8 (ModelOpt FP8), NVFP4 (a 4-bit floating point format from NVIDIA Model Optimizer), MXFP8, and mixed precision (e.g., FP8 dense + NVFP4 MoE). There is no specific --quantization modelopt_fp4 flag; FP4 support (NVFP4) is handled under the unified modelopt quantization method, auto-detected from the model's quantization config (e.g., hf_quant_config.json or quantization_config in config.json). For CLI usage, specify --quantization modelopt when loading ModelOpt-quantized checkpoints from Hugging Face, as shown in examples like nvidia/Llama-3.1-8B-Instruct-FP8.

Citations:


Incorrect vLLM flag mapping for ModelOpt quantization auto-detection.

The JSON path quantization.quant_algo is correct and the algorithm names (FP8, W4A8_AWQ, NVFP4, NVFP4_AWQ) are confirmed in the ModelOpt codebase. However, the proposed flag mapping is incorrect:

  1. --quantization modelopt_fp4 does not exist in vLLM. vLLM provides a single unified flag: --quantization modelopt, which auto-detects the quantization format from the model's quantization config (either hf_quant_config.json or config.json's quantization_config field).

  2. NVFP4 is auto-detected, not mapped to a separate flag. vLLM automatically recognizes NVFP4, W4A8_AWQ, FP8, and other formats when quant_algo is present in the quantization config.

Remove the flag mapping table and replace with: "If hf_quant_config.json exists, vLLM auto-detects the quantization format and applies --quantization modelopt automatically. No additional format-specific flags are needed."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 100 - 116, Update the
documentation in SKILL.md to remove the incorrect per-algorithm vLLM flag
mapping and instead state that if hf_quant_config.json exists (read
quantization.quant_algo), vLLM uses a single unified flag --quantization
modelopt which auto-detects NVFP4, W4A8_AWQ, FP8, etc.; replace the table and
related lines with a concise statement: "If hf_quant_config.json exists, vLLM
auto-detects the quantization format and you should pass --quantization modelopt
(no format-specific flags required)."

Comment on lines +269 to +275
```bash
# If using pre_cmd or post_cmd:
export NEMO_EVALUATOR_TRUST_PRE_CMD=1

# If using nemo_skills.* tasks with self-deployment:
export DUMMY_API_KEY=dummy
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider security implications of NEMO_EVALUATOR_TRUST_PRE_CMD.

Line 271 sets NEMO_EVALUATOR_TRUST_PRE_CMD=1 to enable pre_cmd execution. Since pre_cmd can run arbitrary shell commands (including downloads via curl as shown in line 149), this environment variable effectively disables a security safeguard.

While necessary for the workflow, consider adding a security note warning users to:

  1. Review pre_cmd content in configs before running
  2. Only trust configs from known sources
  3. Understand that pre_cmd runs with the same privileges as NEL
🛡️ Proposed security notice
 **Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.

 ```bash
-# If using pre_cmd or post_cmd:
+# If using pre_cmd or post_cmd (review commands first - they execute with your privileges):
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
# If using pre_cmd or post_cmd:
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
# If using nemo_skills.* tasks with self-deployment:
export DUMMY_API_KEY=dummy
```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 269 - 275, Update the
SKILL.md snippet that documents NEMO_EVALUATOR_TRUST_PRE_CMD to include a clear
security warning: note that setting NEMO_EVALUATOR_TRUST_PRE_CMD=1 enables
execution of pre_cmd and post_cmd which run arbitrary shell commands with the
evaluator's privileges, instruct users to review pre_cmd content, only trust
configs from known sources, and be cautious when using nemo_skills.*
self-deployment tasks; reference the environment variable name
(NEMO_EVALUATOR_TRUST_PRE_CMD) and the config keys (pre_cmd, post_cmd,
nemo_skills.*) so readers can find and audit them.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (24ceba6) to head (68fe89e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1132      +/-   ##
==========================================
+ Coverage   70.15%   70.18%   +0.02%     
==========================================
  Files         230      230              
  Lines       26045    26045              
==========================================
+ Hits        18273    18279       +6     
+ Misses       7772     7766       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
.claude/skills/evaluation/SKILL.md (2)

194-199: ⚠️ Potential issue | 🟠 Major

Add explicit security warning for NEMO_EVALUATOR_TRUST_PRE_CMD.

Line 195 enables trusted command execution, but the snippet lacks a caution that pre_cmd/post_cmd run arbitrary shell commands with evaluator privileges. Please add an explicit warning in this section.

Proposed doc fix
-# If using pre_cmd or post_cmd:
+# If using pre_cmd or post_cmd:
+# WARNING: NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows `pre_cmd`/`post_cmd` execution
+# from config with your current privileges. Review `pre_cmd`, `post_cmd`, and
+# `nemo_skills.*` task settings, and only run trusted configs.
 export NEMO_EVALUATOR_TRUST_PRE_CMD=1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 194 - 199, Add a clear
security warning near the NEMO_EVALUATOR_TRUST_PRE_CMD export: explain that
setting NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows evaluator to run arbitrary shell
commands via pre_cmd and post_cmd with evaluator privileges, warn against
enabling it in untrusted environments or on production hosts, and recommend
alternatives (avoid enabling, run in isolated sandbox/container, or validate
commands) while referencing the environment variable name
NEMO_EVALUATOR_TRUST_PRE_CMD and the pre_cmd/post_cmd hooks so readers know
exactly which settings are risky.

108-115: ⚠️ Potential issue | 🔴 Critical

Incorrect ModelOpt flag mapping for NVFP4 in vLLM docs path.

Line 114 maps NVFP4 / NVFP4_AWQ to --quantization modelopt_fp4, which is likely invalid; this should use the unified ModelOpt quantization flag flow instead. This can misconfigure deployments at runtime.

Proposed doc fix
-| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
-| Other values | Try `--quantization modelopt`; consult vLLM/SGLang docs if unsure |
+| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt` |
+| Other values | Use `--quantization modelopt`; verify backend support in docs if unsure |
What are the valid vLLM CLI values for `--quantization` when loading NVIDIA ModelOpt-quantized checkpoints, and is `modelopt_fp4` a supported value?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 108 - 115, The doc
incorrectly maps NVFP4/NVFP4_AWQ to the unsupported flag `--quantization
modelopt_fp4`; update the mapping logic so that when reading
quantization.quant_algo you add the unified ModelOpt flag (e.g., `--quantization
modelopt`) into deployment.extra_args for NVFP4/NVFP4_AWQ instead of
`modelopt_fp4`, and add a note clarifying that vLLM/SGLang use `modelopt` for
NVIDIA ModelOpt formats and to consult vLLM docs for any newer flag names;
ensure the table rows referencing `NVFP4`, `NVFP4_AWQ`, `modelopt_fp4` are
replaced with `--quantization modelopt` and adjust any code/comments that
conditionally look for `modelopt_fp4`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 194-199: Add a clear security warning near the
NEMO_EVALUATOR_TRUST_PRE_CMD export: explain that setting
NEMO_EVALUATOR_TRUST_PRE_CMD=1 allows evaluator to run arbitrary shell commands
via pre_cmd and post_cmd with evaluator privileges, warn against enabling it in
untrusted environments or on production hosts, and recommend alternatives (avoid
enabling, run in isolated sandbox/container, or validate commands) while
referencing the environment variable name NEMO_EVALUATOR_TRUST_PRE_CMD and the
pre_cmd/post_cmd hooks so readers know exactly which settings are risky.
- Around line 108-115: The doc incorrectly maps NVFP4/NVFP4_AWQ to the
unsupported flag `--quantization modelopt_fp4`; update the mapping logic so that
when reading quantization.quant_algo you add the unified ModelOpt flag (e.g.,
`--quantization modelopt`) into deployment.extra_args for NVFP4/NVFP4_AWQ
instead of `modelopt_fp4`, and add a note clarifying that vLLM/SGLang use
`modelopt` for NVIDIA ModelOpt formats and to consult vLLM docs for any newer
flag names; ensure the table rows referencing `NVFP4`, `NVFP4_AWQ`,
`modelopt_fp4` are replaced with `--quantization modelopt` and adjust any
code/comments that conditionally look for `modelopt_fp4`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 626f8ed2-3b5d-4a8f-b618-c68007312945

📥 Commits

Reviewing files that changed from the base of the PR and between bf4941b and a5eb3a6.

📒 Files selected for processing (15)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/evals/base-model-local-execution.json
  • .claude/skills/evaluation/evals/external-deployment-eval.json
  • .claude/skills/evaluation/evals/interceptor-configuration.json
  • .claude/skills/evaluation/evals/multi-node-evaluation.json
  • .claude/skills/evaluation/evals/nel-not-installed.json
  • .claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json
  • .claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
  • .claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
  • .claude/skills/evaluation/evals/reasoning-model-sglang.json
  • .claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
  • .claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
  • .claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
  • .claude/skills/evaluation/references/model-card-research.md
  • .claude/skills/evaluation/references/multi-node.md
✅ Files skipped from review due to trivial changes (13)
  • .claude/skills/evaluation/evals/nel-not-installed.json
  • .claude/skills/evaluation/evals/interceptor-configuration.json
  • .claude/skills/evaluation/evals/external-deployment-eval.json
  • .claude/skills/evaluation/evals/reasoning-model-sglang.json
  • .claude/skills/evaluation/evals/wandb-export-code-benchmarks.json
  • .claude/skills/evaluation/evals/nvfp4-auto-detect-quantization.json
  • .claude/skills/evaluation/evals/workspace-reuse-from-ptq.json
  • .claude/skills/evaluation/evals/base-model-local-execution.json
  • .claude/skills/evaluation/evals/safety-multilingual-benchmarks.json
  • .claude/skills/evaluation/evals/multi-node-evaluation.json
  • .claude/skills/evaluation/evals/quantized-checkpoint-local-vllm.json
  • .claude/skills/evaluation/references/model-card-research.md
  • .claude/skills/evaluation/evals/nemotron3-nano-bf16-reasoning.json

Signed-off-by: Kai Xu <kaix@nvidia.com>
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments - I think overall it's in a good shape and aligned well with the design we discussed 👍

@@ -0,0 +1,310 @@
---
name: evaluation
description: Evaluate accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Use when user says "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel", or needs to measure how quantization affects model quality. Handles model deployment, config generation, and evaluation execution.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment in the deployment skills PR, we can add some negative triggers as well.


After the dry-run, check the output from `nel` for any problems with the config. If there are no problems, propose to first execute the test run with limited samples and then execute the full evaluation. If there are problems, resolve them before executing the full evaluation.

**Monitoring Progress**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, echo to my comment here, how about we move the monitoring section to a standalone skills (run-and-monitor)? e.g., replace with: "After submission, use the run-and-monitor skill for progress tracking, log inspection, and failure diagnosis. See run-and-monitor/references/nel-execution.md."


If no `hf_quant_config.json`, also check `config.json` for a `quantization_config` section with `quant_method: "modelopt"`. If neither is found, the checkpoint is unquantized — no flag needed.

**Quantization-aware benchmark defaults:**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider extracting quantization benchmarks including benchmark sensitivity ranking and recommended sets to a reference file, e.g., references/quantization-benchmarks.md, so it can be reused by the compare-results skills later.

The reason to have a compare-results skill is that evaluation is about configuring and running NEL, while compare-results is about interpreting and acting on results from multiple runs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice reference - do you think it's better to move model-card-research.md to common/? Since deployment also needs model card research, if both skills reference the same patterns, it should be shared.

When you have all the answers, run the script to build the base config:

```bash
nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to verify benchmark categories to match NEL's build-config CLI - If NEL's categories changed, we need to update accordingly.

|-------------|-------------|
| `FP8` | `--quantization modelopt` |
| `W4A8_AWQ` | `--quantization modelopt` |
| `NVFP4`, `NVFP4_AWQ` | `--quantization modelopt_fp4` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think --quantization modelopt_fp4 is not needed for vllm - we can align with the deployment skill's support-matrix.md.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to https://github.com/NVIDIA/Model-Optimizer/pull/1133/changes#r3013918883 we can align on a convention for the tests.

Also, later we can add more tests, such as SLURM execution, external endpoint (no self-deployment), quantized model where benchmark recommendation triggers, and a case where existing workspace from PTQ is reused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants