-
Notifications
You must be signed in to change notification settings - Fork 739
add verbose debugging to understand nightly pipeline #4864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThese changes add extensive diagnostic debugging blocks to pytest execution for container environment insights and refocus the nightly CI workflow to test exclusively with trtllm framework, commenting out vllm and sglang matrix entries across build and test stages. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
.github/actions/pytest/action.yml (2)
122-124: Consider gating extensive debug output behind an input flag.These DEBUG statements provide useful diagnostics but will print on every test run. Consider adding an optional
debug_modeinput parameter to control their verbosity, so they can be enabled selectively during troubleshooting rather than always active.Add to inputs section:
debug_mode: description: 'Enable verbose debug output for diagnostics' required: false default: 'false'Then conditionally print:
- echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}" - echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}" - echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}" + if [[ "${{ inputs.debug_mode }}" == "true" ]]; then + echo "🔧 DEBUG: About to run pytest with command: ${PYTEST_CMD}" + echo "🔧 DEBUG: GPU_FLAGS: ${GPU_FLAGS}" + echo "🔧 DEBUG: Image tag: ${{ inputs.image_tag }}" + fi
138-145: Also gate exit code interpretation behind debug_mode flag.For consistency with the DEBUG statements above, these exit code explanations should also be conditional. They're helpful for diagnostics but shouldn't spam logs on every run.
+ if [[ "${{ inputs.debug_mode }}" == "true" ]]; then if [[ ${TEST_EXIT_CODE} -eq 127 ]]; then echo "❌ DEBUG: Exit code 127 = Command not found (pytest or bash missing from container)" elif [[ ${TEST_EXIT_CODE} -eq 139 ]]; then echo "❌ DEBUG: Exit code 139 = Segmentation fault (SIGSEGV) - likely a crash in pytest or imported modules" elif [[ ${TEST_EXIT_CODE} -eq 137 ]]; then echo "❌ DEBUG: Exit code 137 = OOM killed (SIGKILL)" fi + fi.github/workflows/nightly-ci.yml (3)
31-32: Document expected timeline for reverting temporary matrix restriction.The matrix is intentionally narrowed to trtllm for exit code debugging. Consider adding a TODO or future milestone comment indicating when full framework matrix should be restored (e.g., after exit codes 127 and 139 are resolved).
- # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64) + # TEMPORARILY REDUCED - focusing on trtllm exit code debugging (127 arm64, 139 amd64) + # TODO: Restore vllm and sglang once exit codes 127/139 are resolved
238-239: Consistent matrix narrowing across all test jobs.All test jobs (unit, integration, e2e-single-gpu, e2e-multi-gpu) consistently use the temporary framework reduction. This aligns with the build jobs and supports focused debugging of exit codes 127/139. Add TODO comments across all jobs to indicate when full framework matrix restoration is expected.
For maintainability, consider extracting the temporary restriction as a workflow env variable at the top of the file to make it easier to restore all matrices at once:
env: REGISTRY_IMAGE: ai-dynamo/dynamo NIGHTLY_IMAGE_PREFIX: nightly # TODO: Restore to [vllm, trtllm, sglang] once exit codes 127/139 are resolved ACTIVE_FRAMEWORKS: [trtllm]Then use
framework: ${{ fromJson(env.ACTIVE_FRAMEWORKS) }}in all matrix definitions. This makes the reversion a single-line change rather than updating multiple files/jobs.Also applies to: 310-311, 376-377, 443-444
299-299: ARM64 dry-run mode is appropriate but document the intent.All ARM64 test jobs use
dry_run: true(collect-only mode) while AMD64 runs full tests. This is appropriate since ARM64 runners are CPU-only (cpu-arm-r8g-4xlarge) without GPU support. Consider adding an inline comment explaining this design choice for future maintainers.- dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }} + # Note: arm64 runners (cpu-arm-r8g-4xlarge) lack GPU; use dry-run to collect tests without execution + dry_run: ${{ matrix.arch.arch == 'arm64' && 'true' || 'false' }}Also applies to: 365-365, 432-432, 499-499
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/actions/pytest/action.yml(3 hunks).github/workflows/nightly-ci.yml(6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Learnt from: bhuvan002
Repo: ai-dynamo/dynamo PR: 2702
File: components/backends/trtllm/src/dynamo/trtllm/logits_processing/adapter.py:7-13
Timestamp: 2025-08-28T20:43:54.701Z
Learning: CI import failures for optional dependencies like tensorrt_llm can be resolved by removing __init__.py files and restructuring imports, rather than wrapping imports in try/except blocks.
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
.github/workflows/nightly-ci.yml
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4864/merge) by nv-nmailhot.
.github/actions/pytest/action.yml
[error] 1-1: Trailing whitespace hook failed. Files were modified by the hook during pre-commit run.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/nightly-ci.yml (1)
126-141: Consistent use of temporary framework narrowing.Build-arm64 matrix correctly comments out vllm and sglang while keeping trtllm active, consistent with the amd64 build job. No concerns with this change—add the same TODO comment suggested above for consistency.
| # DEBUG: Run diagnostic commands inside the container first | ||
| echo "==========================================" | ||
| echo "🔍 DEBUG: Running container diagnostics" | ||
| echo "==========================================" | ||
| echo "📋 Checking container environment..." | ||
| docker run --rm ${{ inputs.image_tag }} bash -c " | ||
| echo '--- Container Info ---' | ||
| uname -a | ||
| echo '' | ||
| echo '--- Python Location ---' | ||
| which python || echo 'python not found' | ||
| which python3 || echo 'python3 not found' | ||
| echo '' | ||
| echo '--- Python Version ---' | ||
| python --version 2>&1 || python3 --version 2>&1 || echo 'Python not available' | ||
| echo '' | ||
| echo '--- Pytest Location ---' | ||
| which pytest || echo 'pytest not found in PATH' | ||
| echo '' | ||
| echo '--- Pytest Version ---' | ||
| pytest --version 2>&1 || echo 'pytest command failed' | ||
| echo '' | ||
| echo '--- PATH ---' | ||
| echo \$PATH | ||
| echo '' | ||
| echo '--- /workspace contents ---' | ||
| ls -la /workspace 2>/dev/null || echo '/workspace does not exist or is empty' | ||
| echo '' | ||
| echo '--- pip list (pytest related) ---' | ||
| pip list 2>/dev/null | grep -i pytest || echo 'pip list failed or no pytest packages' | ||
| " || echo "DEBUG: Container diagnostic command failed with exit code: $?" | ||
| echo "==========================================" | ||
| echo "🔍 DEBUG: Diagnostics complete" | ||
| echo "==========================================" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix trailing whitespace on lines 93 and 97.
The pipeline failure indicates trailing whitespace in this section. Lines with empty diagnostic output headers need cleanup.
echo "=========================================="
echo "🔍 DEBUG: Running container diagnostics"
echo "=========================================="
-
+
echo "📋 Checking container environment..."
...
" || echo "DEBUG: Container diagnostic command failed with exit code: $?"
-
+
echo "=========================================="🤖 Prompt for AI Agents
In .github/actions/pytest/action.yml around lines 61 to 96: there are trailing
spaces at the ends of the empty diagnostic header lines (reported as lines 93
and 97) — remove the trailing whitespace at those line ends so the blank header
lines contain no spaces/tabs, save the file, and re-run the pipeline (or run a
trim-eol/whitespace tidy) to ensure no other end-of-line trailing whitespace
remains.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.