Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Dec 9, 2025

Overview:

Move K8s fault tolerance tests from PR CI to nightly schedule to reduce resource contention and PR merge delays.

Details:

The deploy-test-fault-tolerance job runs 3 sequential K8s deployment tests (vllm, trtllm, sglang) that simulate pod failures in disaggregated prefill/decode scenarios. Running these tests on every PR causes significant delays due to:

  1. Resource Contention: The K8s cluster has limited capacity, and these tests require significant resources (L40 GPUs, persistent storage, Istio ingress)
  2. Pod Scheduling Delays: Pods are very hard to get ready during peak CI hours, often taking hours just to schedule
  3. Sequential Execution: Tests run sequentially (max-parallel: 1) to avoid Helm ClusterRole conflicts, multiplying the wait time
  4. Total Impact: Combined with pod scheduling + test execution, PRs experience ~3 hour delays waiting for these tests

See related discussion: https://nvidia.slack.com/archives/C08UQCG3RNV/p1764963329433749?thread_ts=1764946443.307469&cid=C08UQCG3RNV

Changes Made:

  • Added schedule trigger to run fault tolerance tests nightly at 08:00 UTC
  • Modified deploy-test-fault-tolerance job to only run on:
    • Scheduled nightly runs (github.event_name == 'schedule')
    • Manual workflow dispatch (run_deploy_operator == true)
  • Regular PR CI still runs fast checks: builds, unit tests, integration tests, basic e2e tests

Where should the reviewer start?

Review .github/workflows/container-validation-backends.yml:

  • Line 19-20: Added nightly schedule trigger
  • Line 335-338: Modified deploy-test-fault-tolerance condition to only run on schedule or manual trigger

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1156

Summary by CodeRabbit

  • Chores
    • Enhanced automated testing infrastructure with scheduled daily fault tolerance test execution.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested a review from a team as a code owner December 9, 2025 04:22
@tzulingk tzulingk changed the title Move k8 fault tolerance to nightly feat: Move k8 fault tolerance to nightly Dec 9, 2025
@github-actions github-actions bot added the feat label Dec 9, 2025
@tzulingk tzulingk requested review from nnshah1 and rmccorm4 December 9, 2025 04:25
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

A GitHub Actions workflow file was modified to add a daily schedule trigger (8 AM UTC) and update job execution conditions to run the deploy-test-fault-tolerance job on schedule or when explicitly requested via run_deploy_operator flag.

Changes

Cohort / File(s) Summary
Workflow Scheduling & Conditions
.github/workflows/container-validation-backends.yml
Added daily cron schedule trigger (0 8 * * *). Updated deploy-test-fault-tolerance job condition to execute on schedule or explicit run_deploy_operator request.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify cron schedule syntax is correct for intended daily execution time
  • Confirm conditional logic properly combines schedule and manual trigger flags
  • Review any potential overlap with existing has_code_changes condition behavior

Poem

🐰 A schedule awakens each morning at eight,
Daily fault tolerance tests won't be late!
With flags and with conditions, the workflow runs free,
Automation and testing—how grand it will be! 🚀

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The description fully addresses all required template sections with clear context, rationale, specific changes, and reviewer guidance.
Title check ✅ Passed The title 'feat: Move k8 fault tolerance to nightly' clearly and specifically describes the main change: moving Kubernetes fault tolerance tests from regular PR CI to a nightly schedule.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ca1679 and 5f60d44.

📒 Files selected for processing (1)
  • .github/workflows/container-validation-backends.yml (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • .github/workflows/container-validation-backends.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: lychee
  • GitHub Check: pre-commit
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/container-validation-backends.yml (1)

19-20: Schedule trigger correctly configured for nightly runs.

The cron expression 0 8 * * * is valid and will trigger daily at 08:00 UTC as intended.

Copy link
Contributor

@nv-anants nv-anants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the workflow to nightly-ci yaml file please. It would show up separately otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants