add verbose debugging to understand nightly pipeline #4864

nv-nmailhot · 2025-12-10T18:45:33Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Enhanced CI test diagnostics with verbose debugging information for improved troubleshooting and visibility
- Streamlined nightly CI testing to focus on the primary framework, optimizing build and test pipeline efficiency

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-10T18:49:28Z

Walkthrough

These changes add extensive diagnostic debugging blocks to pytest execution for container environment insights and refocus the nightly CI workflow to test exclusively with trtllm framework, commenting out vllm and sglang matrix entries across build and test stages.

Changes

Cohort / File(s)	Summary
Pytest Action Debugging `\`.github/actions/pytest/action.yml``	Introduces verbose DEBUG echo statements for container diagnostics (environment checks, GPU detection, pytest command, image tag, runtime flags) before test execution and after results collection. Adds exit-code interpretation for crash scenarios (127, 139, 137).
Nightly CI Framework Focus `\`.github/workflows/nightly-ci.yml``	Removes vllm and sglang framework entries from matrix definitions across build, unit test, integration test, and e2e test stages. Restricts execution to trtllm framework only for both AMD64 and ARM64 architectures. Comments out alternative framework example blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

.github/workflows/nightly-ci.yml — Verify that all matrix entries are consistently modified across stages and that the workflow structure remains valid after framework narrowing
.github/actions/pytest/action.yml — Confirm exit-code interpretations (127, 139, 137) align with intended failure handling logic

Poem

🐰 A bunny hopped through CI lanes so bright,
With trtllm only, focused and tight!
Debug echoes whisper secrets of shells,
While vllm sleeps, and sglang bids farewell—
One framework to test, the nightly spark gleams! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description contains only placeholder template text with no substantive information filled in. All sections are empty or contain template examples like '#xxx'.	Fill in all required sections: provide an Overview of the changes, describe the Details of modifications made, specify which files reviewers should start with, and reference any related GitHub issues with proper action keywords.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding verbose debugging to the nightly pipeline. It aligns with the workflow and action file modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

.github/actions/pytest/action.yml (2)
122-124: Consider gating extensive debug output behind an input flag.

These DEBUG statements provide useful diagnostics but will print on every test run. Consider adding an optional debug_mode input parameter to control their verbosity, so they can be enabled selectively during troubleshooting rather than always active.

Add to inputs section:
  debug_mode:
    description: 'Enable verbose debug output for diagnostics'
    required: false
    default: 'false'
Then conditionally print:
-        echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}"
-        echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}"
-        echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}"
+        if [[ "${{ inputs.debug_mode }}" == "true" ]]; then
+          echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}"
+          echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}"
+          echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}"
+        fi
138-145: Also gate exit code interpretation behind debug_mode flag.

For consistency with the DEBUG statements above, these exit code explanations should also be conditional. They're helpful for diagnostics but shouldn't spam logs on every run.
+        if [[ "${{ inputs.debug_mode }}" == "true" ]]; then
           if [[ ${TEST_EXIT_CODE} -eq 127 ]]; then
             echo "❌ DEBUG: Exit code 127 = Command not found (pytest or bash missing from container)"
           elif [[ ${TEST_EXIT_CODE} -eq 139 ]]; then
             echo "❌ DEBUG: Exit code 139 = Segmentation fault (SIGSEGV) - likely a crash in pytest or imported modules"
           elif [[ ${TEST_EXIT_CODE} -eq 137 ]]; then
             echo "❌ DEBUG: Exit code 137 = OOM killed (SIGKILL)"
           fi
+        fi
.github/workflows/nightly-ci.yml (3)
31-32: Document expected timeline for reverting temporary matrix restriction.

The matrix is intentionally narrowed to trtllm for exit code debugging. Consider adding a TODO or future milestone comment indicating when full framework matrix should be restored (e.g., after exit codes 127 and 139 are resolved).
-        # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64)
+        # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64)
+        # TODO: Restore vllm and sglang once exit codes 127/139 are resolved
238-239: Consistent matrix narrowing across all test jobs.

All test jobs (unit, integration, e2e-single-gpu, e2e-multi-gpu) consistently use the temporary framework reduction. This aligns with the build jobs and supports focused debugging of exit codes 127/139. Add TODO comments across all jobs to indicate when full framework matrix restoration is expected.

For maintainability, consider extracting the temporary restriction as a workflow env variable at the top of the file to make it easier to restore all matrices at once:
env:
  REGISTRY_IMAGE: ai-dynamo/dynamo
  NIGHTLY_IMAGE_PREFIX: nightly
  # TODO: Restore to [vllm, trtllm, sglang] once exit codes 127/139 are resolved
  ACTIVE_FRAMEWORKS: [trtllm]
Then use framework: ${{ fromJson(env.ACTIVE_FRAMEWORKS) }} in all matrix definitions. This makes the reversion a single-line change rather than updating multiple files/jobs.

Also applies to: 310-311, 376-377, 443-444

299-299: ARM64 dry-run mode is appropriate but document the intent.

All ARM64 test jobs use dry_run: true (collect-only mode) while AMD64 runs full tests. This is appropriate since ARM64 runners are CPU-only (cpu-arm-r8g-4xlarge) without GPU support. Consider adding an inline comment explaining this design choice for future maintainers.
-          dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }}
+          # Note: arm64 runners (cpu-arm-r8g-4xlarge) lack GPU; use dry-run to collect tests without execution
+          dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }}
Also applies to: 365-365, 432-432, 499-499

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf269db and b9515a4.

📒 Files selected for processing (2)

.github/actions/pytest/action.yml (3 hunks)
.github/workflows/nightly-ci.yml (6 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Learnt from: bhuvan002
Repo: ai-dynamo/dynamo PR: 2702
File: components/backends/trtllm/src/dynamo/trtllm/logits_processing/adapter.py:7-13
Timestamp: 2025-08-28T20:43:54.701Z
Learning: CI import failures for optional dependencies like tensorrt_llm can be resolved by removing __init__.py files and restructuring imports, rather than wrapping imports in try/except blocks.

📚 Learning: 2025-12-02T18:13:40.065Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

.github/workflows/nightly-ci.yml

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4864/merge) by nv-nmailhot.

.github/actions/pytest/action.yml

[error] 1-1: Trailing whitespace hook failed. Files were modified by the hook during pre-commit run.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: sglang (arm64)
GitHub Check: operator (arm64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

.github/workflows/nightly-ci.yml (1)

126-141: Consistent use of temporary framework narrowing.

Build-arm64 matrix correctly comments out vllm and sglang while keeping trtllm active, consistent with the amd64 build job. No concerns with this change—add the same TODO comment suggested above for consistency.

coderabbitai · 2025-12-10T18:49:31Z

.github/actions/pytest/action.yml

+        # DEBUG: Run diagnostic commands inside the container first
+        echo "=========================================="
+        echo "🔍 DEBUG: Running container diagnostics"
+        echo "=========================================="
+        
+        echo "📋 Checking container environment..."
+        docker run --rm ${{ inputs.image_tag }} bash -c "
+          echo '--- Container Info ---'
+          uname -a
+          echo ''
+          echo '--- Python Location ---'
+          which python || echo 'python not found'
+          which python3 || echo 'python3 not found'
+          echo ''
+          echo '--- Python Version ---'
+          python --version 2>&1 || python3 --version 2>&1 || echo 'Python not available'
+          echo ''
+          echo '--- Pytest Location ---'
+          which pytest || echo 'pytest not found in PATH'
+          echo ''
+          echo '--- Pytest Version ---'
+          pytest --version 2>&1 || echo 'pytest command failed'
+          echo ''
+          echo '--- PATH ---'
+          echo \$PATH
+          echo ''
+          echo '--- /workspace contents ---'
+          ls -la /workspace 2>/dev/null || echo '/workspace does not exist or is empty'
+          echo ''
+          echo '--- pip list (pytest related) ---'
+          pip list 2>/dev/null | grep -i pytest || echo 'pip list failed or no pytest packages'
+        " || echo "DEBUG: Container diagnostic command failed with exit code: $?"
+        
+        echo "=========================================="
+        echo "🔍 DEBUG: Diagnostics complete"
+        echo "=========================================="


⚠️ Potential issue | 🔴 Critical

Fix trailing whitespace on lines 93 and 97.

The pipeline failure indicates trailing whitespace in this section. Lines with empty diagnostic output headers need cleanup.

echo "==========================================" echo "🔍 DEBUG: Running container diagnostics" echo "==========================================" - + echo "📋 Checking container environment..." ... " || echo "DEBUG: Container diagnostic command failed with exit code: $?" - + echo "=========================================="

🤖 Prompt for AI Agents

In .github/actions/pytest/action.yml around lines 61 to 96: there are trailing spaces at the ends of the empty diagnostic header lines (reported as lines 93 and 97) — remove the trailing whitespace at those line ends so the blank header lines contain no spaces/tabs, save the file, and re-run the pipeline (or run a trim-eol/whitespace tidy) to ensure no other end-of-line trailing whitespace remains.

add verbose debugging to understand nightly pipeline

b9515a4

nv-nmailhot requested a review from a team as a code owner December 10, 2025 18:45

pull-request-size bot added the size/M label Dec 10, 2025

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

trigger on pr

84a15e0

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 18:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 18:57 Inactive

add more verbose debugging

1fe6c95

pull-request-size bot added size/L and removed size/M labels Dec 10, 2025

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 22:32 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 22:36 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add verbose debugging to understand nightly pipeline #4864

add verbose debugging to understand nightly pipeline #4864

Uh oh!

nv-nmailhot commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add verbose debugging to understand nightly pipeline #4864

Are you sure you want to change the base?

add verbose debugging to understand nightly pipeline #4864

Uh oh!

Conversation

nv-nmailhot commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 10, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Dec 10, 2025 •

edited by coderabbitai bot

Loading