Skip to content

Conversation

@nv-nmailhot
Copy link
Contributor

@nv-nmailhot nv-nmailhot commented Dec 10, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Chores
    • Enhanced CI test diagnostics with verbose debugging information for improved troubleshooting and visibility
    • Streamlined nightly CI testing to focus on the primary framework, optimizing build and test pipeline efficiency

✏️ Tip: You can customize this high-level summary in your review settings.

@nv-nmailhot nv-nmailhot requested a review from a team as a code owner December 10, 2025 18:45
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

Walkthrough

These changes add extensive diagnostic debugging blocks to pytest execution for container environment insights and refocus the nightly CI workflow to test exclusively with trtllm framework, commenting out vllm and sglang matrix entries across build and test stages.

Changes

Cohort / File(s) Summary
Pytest Action Debugging
\.github/actions/pytest/action.yml``
Introduces verbose DEBUG echo statements for container diagnostics (environment checks, GPU detection, pytest command, image tag, runtime flags) before test execution and after results collection. Adds exit-code interpretation for crash scenarios (127, 139, 137).
Nightly CI Framework Focus
\.github/workflows/nightly-ci.yml``
Removes vllm and sglang framework entries from matrix definitions across build, unit test, integration test, and e2e test stages. Restricts execution to trtllm framework only for both AMD64 and ARM64 architectures. Comments out alternative framework example blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • .github/workflows/nightly-ci.yml — Verify that all matrix entries are consistently modified across stages and that the workflow structure remains valid after framework narrowing
  • .github/actions/pytest/action.yml — Confirm exit-code interpretations (127, 139, 137) align with intended failure handling logic

Poem

🐰 A bunny hopped through CI lanes so bright,
With trtllm only, focused and tight!
Debug echoes whisper secrets of shells,
While vllm sleeps, and sglang bids farewell—
One framework to test, the nightly spark gleams! ✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description contains only placeholder template text with no substantive information filled in. All sections are empty or contain template examples like '#xxx'. Fill in all required sections: provide an Overview of the changes, describe the Details of modifications made, specify which files reviewers should start with, and reference any related GitHub issues with proper action keywords.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: adding verbose debugging to the nightly pipeline. It aligns with the workflow and action file modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
.github/actions/pytest/action.yml (2)

122-124: Consider gating extensive debug output behind an input flag.

These DEBUG statements provide useful diagnostics but will print on every test run. Consider adding an optional debug_mode input parameter to control their verbosity, so they can be enabled selectively during troubleshooting rather than always active.

Add to inputs section:

  debug_mode:
    description: 'Enable verbose debug output for diagnostics'
    required: false
    default: 'false'

Then conditionally print:

-        echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}"
-        echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}"
-        echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}"
+        if [[ "${{ inputs.debug_mode }}" == "true" ]]; then
+          echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}"
+          echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}"
+          echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}"
+        fi

138-145: Also gate exit code interpretation behind debug_mode flag.

For consistency with the DEBUG statements above, these exit code explanations should also be conditional. They're helpful for diagnostics but shouldn't spam logs on every run.

+        if [[ "${{ inputs.debug_mode }}" == "true" ]]; then
           if [[ ${TEST_EXIT_CODE} -eq 127 ]]; then
             echo "❌ DEBUG: Exit code 127 = Command not found (pytest or bash missing from container)"
           elif [[ ${TEST_EXIT_CODE} -eq 139 ]]; then
             echo "❌ DEBUG: Exit code 139 = Segmentation fault (SIGSEGV) - likely a crash in pytest or imported modules"
           elif [[ ${TEST_EXIT_CODE} -eq 137 ]]; then
             echo "❌ DEBUG: Exit code 137 = OOM killed (SIGKILL)"
           fi
+        fi
.github/workflows/nightly-ci.yml (3)

31-32: Document expected timeline for reverting temporary matrix restriction.

The matrix is intentionally narrowed to trtllm for exit code debugging. Consider adding a TODO or future milestone comment indicating when full framework matrix should be restored (e.g., after exit codes 127 and 139 are resolved).

-        # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64)
+        # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64)
+        # TODO: Restore vllm and sglang once exit codes 127/139 are resolved

238-239: Consistent matrix narrowing across all test jobs.

All test jobs (unit, integration, e2e-single-gpu, e2e-multi-gpu) consistently use the temporary framework reduction. This aligns with the build jobs and supports focused debugging of exit codes 127/139. Add TODO comments across all jobs to indicate when full framework matrix restoration is expected.

For maintainability, consider extracting the temporary restriction as a workflow env variable at the top of the file to make it easier to restore all matrices at once:

env:
  REGISTRY_IMAGE: ai-dynamo/dynamo
  NIGHTLY_IMAGE_PREFIX: nightly
  # TODO: Restore to [vllm, trtllm, sglang] once exit codes 127/139 are resolved
  ACTIVE_FRAMEWORKS: [trtllm]

Then use framework: ${{ fromJson(env.ACTIVE_FRAMEWORKS) }} in all matrix definitions. This makes the reversion a single-line change rather than updating multiple files/jobs.

Also applies to: 310-311, 376-377, 443-444


299-299: ARM64 dry-run mode is appropriate but document the intent.

All ARM64 test jobs use dry_run: true (collect-only mode) while AMD64 runs full tests. This is appropriate since ARM64 runners are CPU-only (cpu-arm-r8g-4xlarge) without GPU support. Consider adding an inline comment explaining this design choice for future maintainers.

-          dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }}
+          # Note: arm64 runners (cpu-arm-r8g-4xlarge) lack GPU; use dry-run to collect tests without execution
+          dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }}

Also applies to: 365-365, 432-432, 499-499

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf269db and b9515a4.

📒 Files selected for processing (2)
  • .github/actions/pytest/action.yml (3 hunks)
  • .github/workflows/nightly-ci.yml (6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Learnt from: bhuvan002
Repo: ai-dynamo/dynamo PR: 2702
File: components/backends/trtllm/src/dynamo/trtllm/logits_processing/adapter.py:7-13
Timestamp: 2025-08-28T20:43:54.701Z
Learning: CI import failures for optional dependencies like tensorrt_llm can be resolved by removing __init__.py files and restructuring imports, rather than wrapping imports in try/except blocks.
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • .github/workflows/nightly-ci.yml
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4864/merge) by nv-nmailhot.
.github/actions/pytest/action.yml

[error] 1-1: Trailing whitespace hook failed. Files were modified by the hook during pre-commit run.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/nightly-ci.yml (1)

126-141: Consistent use of temporary framework narrowing.

Build-arm64 matrix correctly comments out vllm and sglang while keeping trtllm active, consistent with the amd64 build job. No concerns with this change—add the same TODO comment suggested above for consistency.

Comment on lines +61 to +96
# DEBUG: Run diagnostic commands inside the container first
echo "=========================================="
echo "🔍 DEBUG: Running container diagnostics"
echo "=========================================="
echo "📋 Checking container environment..."
docker run --rm ${{ inputs.image_tag }} bash -c "
echo '--- Container Info ---'
uname -a
echo ''
echo '--- Python Location ---'
which python || echo 'python not found'
which python3 || echo 'python3 not found'
echo ''
echo '--- Python Version ---'
python --version 2>&1 || python3 --version 2>&1 || echo 'Python not available'
echo ''
echo '--- Pytest Location ---'
which pytest || echo 'pytest not found in PATH'
echo ''
echo '--- Pytest Version ---'
pytest --version 2>&1 || echo 'pytest command failed'
echo ''
echo '--- PATH ---'
echo \$PATH
echo ''
echo '--- /workspace contents ---'
ls -la /workspace 2>/dev/null || echo '/workspace does not exist or is empty'
echo ''
echo '--- pip list (pytest related) ---'
pip list 2>/dev/null | grep -i pytest || echo 'pip list failed or no pytest packages'
" || echo "DEBUG: Container diagnostic command failed with exit code: $?"
echo "=========================================="
echo "🔍 DEBUG: Diagnostics complete"
echo "=========================================="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix trailing whitespace on lines 93 and 97.

The pipeline failure indicates trailing whitespace in this section. Lines with empty diagnostic output headers need cleanup.

         echo "=========================================="
         echo "🔍 DEBUG: Running container diagnostics"
         echo "=========================================="
-        
+
         echo "📋 Checking container environment..."
         ...
         " || echo "DEBUG: Container diagnostic command failed with exit code: $?"
-        
+
         echo "=========================================="
🤖 Prompt for AI Agents
In .github/actions/pytest/action.yml around lines 61 to 96: there are trailing
spaces at the ends of the empty diagnostic header lines (reported as lines 93
and 97) — remove the trailing whitespace at those line ends so the blank header
lines contain no spaces/tabs, save the file, and re-run the pipeline (or run a
trim-eol/whitespace tidy) to ensure no other end-of-line trailing whitespace
remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants