Skip to content

Conversation

@byteZorvin
Copy link
Member

Pull Request type

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build-related changes
  • Documentation content changes
  • Testing
  • Other (please describe):

What is the current behavior?

When the SNOS RPC endpoint (Pathfinder) is down, all SnosRun jobs fail one by one, sending alerts for each failure. This causes:

  1. Multiple unnecessary job failures when external infrastructure is temporarily unavailable
  2. Alert fatigue from continuous failure notifications
  3. Jobs being marked as failed when they should simply wait for the dependency to recover

Resolves: #NA

What is the new behavior?

  • Health check before processing: Added check_ready_to_process() method to JobHandlerTrait that allows handlers to verify dependencies are available before job processing begins
  • SNOS health check: SnosJobHandler implements this method to check if SNOS RPC is reachable using chain_id() RPC call
  • Graceful requeue: When SNOS is unavailable, jobs are requeued with a 60-second delay instead of failing
  • Single alert per downtime: Uses AtomicBool to send only one alert when SNOS goes down, preventing alert fatigue
  • Auto-recovery: Alert flag resets when SNOS recovers, so the next downtime will trigger a new alert

Flow:

process_job() called
    │
    ├─► check_ready_to_process()
    │       │
    │       ├─► SNOS healthy → Ok() → continue processing
    │       │
    │       └─► SNOS down → Err(60s delay)
    │               │
    │               ├─► Send alert (once per downtime)
    │               └─► Requeue job with delay
    │
    └─► Job not marked as failed, status remains "Created"

Does this introduce a breaking change?

No - This is an additive change. The check_ready_to_process() method has a default implementation returning Ok(()), so existing job handlers continue to work without modification.

Other information

Files changed:

  • orchestrator/src/worker/event_handler/jobs/mod.rs - Added check_ready_to_process to trait
  • orchestrator/src/worker/event_handler/jobs/snos.rs - Implemented health check for SNOS
  • orchestrator/src/worker/event_handler/service.rs - Call check before processing
  • orchestrator/src/worker/service.rs - Made add_job_to_queue pub(crate)
  • orchestrator/src/tests/jobs/snos_job/mod.rs - Added health check tests
  • orchestrator/src/tests/jobs/mod.rs - Added requeue behavior test

…of failing

- Add check_ready_to_process method to JobHandlerTrait for dependency checks
- Implement SNOS health check using chain_id RPC call
- Requeue jobs with delay when SNOS RPC is down instead of failing them
- Send single alert per downtime period to avoid alert fatigue
- Add tests for health check and requeue behavior
}

async fn check_ready_to_process(&self, config: Arc<Config>) -> Result<(), Duration> {
let snos_url = config.snos_config().rpc_for_snos.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpc_for_snos is already a Url, any reason to re-parse it ?

Comment on lines +240 to +246
if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) {
let alert_msg =
format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url);
if let Err(e) = config.alerts().send_message(alert_msg).await {
error!(error = ?e, "Failed to send SNOS unavailability alert");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If config.alerts().send_message(alert_msg) fails, the flag is already set to true, so the alert will never be attempted again (even if SNOS recovers and goes down again later) !

Comment on lines +240 to +246
if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) {
let alert_msg =
format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url);
if let Err(e) = config.alerts().send_message(alert_msg).await {
error!(error = ?e, "Failed to send SNOS unavailability alert");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this will only solve alert aggregation properly only when we have 1 pod of orchestrator running.
Over on multiple pods the aggregation reduce inversely with number of pods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just handle aggregation outside of orchestrator ?

@heemankv heemankv marked this pull request as ready for review December 10, 2025 17:14
@heemankv heemankv marked this pull request as draft December 10, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants