-
Notifications
You must be signed in to change notification settings - Fork 76
fix(orchestrator): requeue SNOS jobs when RPC is unavailable #897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…of failing - Add check_ready_to_process method to JobHandlerTrait for dependency checks - Implement SNOS health check using chain_id RPC call - Requeue jobs with delay when SNOS RPC is down instead of failing them - Send single alert per downtime period to avoid alert fatigue - Add tests for health check and requeue behavior
| } | ||
|
|
||
| async fn check_ready_to_process(&self, config: Arc<Config>) -> Result<(), Duration> { | ||
| let snos_url = config.snos_config().rpc_for_snos.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpc_for_snos is already a Url, any reason to re-parse it ?
| if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) { | ||
| let alert_msg = | ||
| format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url); | ||
| if let Err(e) = config.alerts().send_message(alert_msg).await { | ||
| error!(error = ?e, "Failed to send SNOS unavailability alert"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If config.alerts().send_message(alert_msg) fails, the flag is already set to true, so the alert will never be attempted again (even if SNOS recovers and goes down again later) !
| if !SNOS_UNAVAILABLE_ALERT_SENT.swap(true, Ordering::SeqCst) { | ||
| let alert_msg = | ||
| format!("SNOS RPC {} is unavailable. SnosRun jobs will be requeued until it recovers.", snos_url); | ||
| if let Err(e) = config.alerts().send_message(alert_msg).await { | ||
| error!(error = ?e, "Failed to send SNOS unavailability alert"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this will only solve alert aggregation properly only when we have 1 pod of orchestrator running.
Over on multiple pods the aggregation reduce inversely with number of pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just handle aggregation outside of orchestrator ?
Pull Request type
What is the current behavior?
When the SNOS RPC endpoint (Pathfinder) is down, all SnosRun jobs fail one by one, sending alerts for each failure. This causes:
Resolves: #NA
What is the new behavior?
check_ready_to_process()method toJobHandlerTraitthat allows handlers to verify dependencies are available before job processing beginsSnosJobHandlerimplements this method to check if SNOS RPC is reachable usingchain_id()RPC callAtomicBoolto send only one alert when SNOS goes down, preventing alert fatigueFlow:
Does this introduce a breaking change?
No - This is an additive change. The
check_ready_to_process()method has a default implementation returningOk(()), so existing job handlers continue to work without modification.Other information
Files changed:
orchestrator/src/worker/event_handler/jobs/mod.rs- Addedcheck_ready_to_processto traitorchestrator/src/worker/event_handler/jobs/snos.rs- Implemented health check for SNOSorchestrator/src/worker/event_handler/service.rs- Call check before processingorchestrator/src/worker/service.rs- Madeadd_job_to_queuepub(crate)orchestrator/src/tests/jobs/snos_job/mod.rs- Added health check testsorchestrator/src/tests/jobs/mod.rs- Added requeue behavior test