fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897

byteZorvin · 2025-12-10T15:32:03Z

Pull Request type

What is the current behavior?

When the SNOS RPC endpoint (Pathfinder) is down, all SnosRun jobs fail one by one, sending alerts for each failure. This causes:

Multiple unnecessary job failures when external infrastructure is temporarily unavailable
Alert fatigue from continuous failure notifications
Jobs being marked as failed when they should simply wait for the dependency to recover

Resolves: #NA

What is the new behavior?

Health check before processing: Added check_ready_to_process() method to JobHandlerTrait that allows handlers to verify dependencies are available before job processing begins
SNOS health check: SnosJobHandler implements this method to check if SNOS RPC is reachable using chain_id() RPC call
Graceful requeue: When SNOS is unavailable, jobs are requeued with a 60-second delay instead of failing
Single alert per downtime: Uses AtomicBool to send only one alert when SNOS goes down, preventing alert fatigue
Auto-recovery: Alert flag resets when SNOS recovers, so the next downtime will trigger a new alert

Flow:

process_job() called
    │
    ├─► check_ready_to_process()
    │       │
    │       ├─► SNOS healthy → Ok() → continue processing
    │       │
    │       └─► SNOS down → Err(60s delay)
    │               │
    │               ├─► Send alert (once per downtime)
    │               └─► Requeue job with delay
    │
    └─► Job not marked as failed, status remains "Created"

Does this introduce a breaking change?

No - This is an additive change. The check_ready_to_process() method has a default implementation returning Ok(()), so existing job handlers continue to work without modification.

Other information

Files changed:

orchestrator/src/worker/event_handler/jobs/mod.rs - Added check_ready_to_process to trait
orchestrator/src/worker/event_handler/jobs/snos.rs - Implemented health check for SNOS
orchestrator/src/worker/event_handler/service.rs - Call check before processing
orchestrator/src/worker/service.rs - Made add_job_to_queue pub(crate)
orchestrator/src/tests/jobs/snos_job/mod.rs - Added health check tests
orchestrator/src/tests/jobs/mod.rs - Added requeue behavior test

…of failing - Add check_ready_to_process method to JobHandlerTrait for dependency checks - Implement SNOS health check using chain_id RPC call - Requeue jobs with delay when SNOS RPC is down instead of failing them - Send single alert per downtime period to avoid alert fatigue - Add tests for health check and requeue behavior

heemankv · 2025-12-10T16:24:54Z

orchestrator/src/worker/event_handler/jobs/snos.rs

    }
+
+    async fn check_ready_to_process(&self, config: Arc<Config>) -> Result<(), Duration> {
+        let snos_url = config.snos_config().rpc_for_snos.to_string();


rpc_for_snos is already a Url, any reason to re-parse it ?

heemankv · 2025-12-10T16:31:16Z

orchestrator/src/worker/event_handler/jobs/snos.rs

+            if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) {
+                let alert_msg =
+                    format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url);
+                if let Err(e) = config.alerts().send_message(alert_msg).await {
+                    error!(error = ?e, "Failed to send SNOS unavailability alert");
+                }
+            }


If config.alerts().send_message(alert_msg) fails, the flag is already set to true, so the alert will never be attempted again (even if SNOS recovers and goes down again later) !

heemankv · 2025-12-10T16:54:28Z

orchestrator/src/worker/event_handler/jobs/snos.rs

+            if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) {
+                let alert_msg =
+                    format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url);
+                if let Err(e) = config.alerts().send_message(alert_msg).await {
+                    error!(error = ?e, "Failed to send SNOS unavailability alert");
+                }
+            }


Also this will only solve alert aggregation properly only when we have 1 pod of orchestrator running.
Over on multiple pods the aggregation reduce inversely with number of pods

Should we just handle aggregation outside of orchestrator ?

github-project-automation bot added this to Madara Dec 10, 2025

heemankv requested changes Dec 10, 2025

View reviewed changes

heemankv marked this pull request as ready for review December 10, 2025 17:14

heemankv marked this pull request as draft December 10, 2025 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897

fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897

byteZorvin commented Dec 10, 2025

Uh oh!

heemankv Dec 10, 2025

Uh oh!

heemankv Dec 10, 2025

Uh oh!

heemankv Dec 10, 2025

Uh oh!

heemankv Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897

Are you sure you want to change the base?

fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897

Conversation

byteZorvin commented Dec 10, 2025

Pull Request type

What is the current behavior?

What is the new behavior?

Flow:

Does this introduce a breaking change?

Other information

Files changed:

Uh oh!

heemankv Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

heemankv Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

heemankv Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

heemankv Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants