-
Notifications
You must be signed in to change notification settings - Fork 728
feat: Move k8 fault tolerance to nightly #4819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: [email protected] <[email protected]>
WalkthroughA GitHub Actions workflow file was modified to add a daily schedule trigger (8 AM UTC) and update job execution conditions to run the deploy-test-fault-tolerance job on schedule or when explicitly requested via run_deploy_operator flag. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/container-validation-backends.yml(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
.github/workflows/container-validation-backends.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: lychee
- GitHub Check: pre-commit
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/container-validation-backends.yml (1)
19-20: Schedule trigger correctly configured for nightly runs.The cron expression
0 8 * * *is valid and will trigger daily at 08:00 UTC as intended.
Signed-off-by: [email protected] <[email protected]>
nv-anants
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the workflow to nightly-ci yaml file please. It would show up separately otherwise
Overview:
Move K8s fault tolerance tests from PR CI to nightly schedule to reduce resource contention and PR merge delays.
Details:
The
deploy-test-fault-tolerancejob runs 3 sequential K8s deployment tests (vllm, trtllm, sglang) that simulate pod failures in disaggregated prefill/decode scenarios. Running these tests on every PR causes significant delays due to:See related discussion: https://nvidia.slack.com/archives/C08UQCG3RNV/p1764963329433749?thread_ts=1764946443.307469&cid=C08UQCG3RNV
Changes Made:
scheduletrigger to run fault tolerance tests nightly at 08:00 UTCdeploy-test-fault-tolerancejob to only run on:github.event_name == 'schedule')run_deploy_operator == true)Where should the reviewer start?
Review
.github/workflows/container-validation-backends.yml:deploy-test-fault-tolerancecondition to only run on schedule or manual triggerRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-1156
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.