-
Notifications
You must be signed in to change notification settings - Fork 76
Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…-alliance/madara into orchestrator/better-atlantic-logs
b4d3088 to
f2a47a0
Compare
heemankv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review done, lgtm
Mohiiit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review comments pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good refactoring with clean layered architecture! A few suggestions:
| Err(last_error.expect("At least one attempt should have been made")) | ||
| // All retries exhausted | ||
| let total_duration_s = start_time.elapsed().as_secs_f64(); | ||
| let final_error = last_error.expect("At least one attempt should have been made"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment explaining why this .expect() is safe (e.g., "loop guarantees at least one attempt was made").
| return Ok(result); | ||
| } | ||
| Err(err) => { | ||
| let _attempt_duration = attempt_start.elapsed().as_millis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_attempt_duration is computed but never used. Consider removing or using it in per-attempt metrics/logging.
| .retry_request( | ||
| "submit_l2_query", | ||
| &context, | ||
| 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_size_bytes is 0 but proof_size is already computed. Consider passing proof_size as u64 for accurate request-size metrics.
|
|
||
| impl AtlanticError { | ||
| /// Check if this error is retryable | ||
| pub fn is_retryable(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a doc comment clarifying that only NetworkError is retryable, and ApiError (including 5xx from real API responses) is intentionally not retried.
| && response_text.contains("\"0\"")) | ||
| // 404 from infrastructure (not a real Atlantic API 404) | ||
| // Real Atlantic 404s would have proper JSON with "error" field | ||
| || (status == StatusCode::NOT_FOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 404 infrastructure heuristic checks for absence of keywords. A real API 404 without these substrings could be incorrectly retried. Consider also checking for valid JSON structure or a known error field.
| || response_text.contains("<!DOCTYPE") | ||
| // Weird JSON-encoded string format from stress testing | ||
| // e.g., {"0":"4","1":"0","2":"4",...} encoding "404 page not found" | ||
| || (response_text.starts_with("{\"0\":") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON-encoded string detection is a very specific edge case. Consider adding a comment explaining when/where this was observed, or removing it if no longer relevant.
| info!( | ||
| service_name = %config.service_name, | ||
| "Initializing OpenTelemetry exporters..." | ||
| ); | ||
|
|
||
| let meter_provider = Self::instrument_metric_provider(config, endpoint)?; | ||
| info!( | ||
| service_name = format!("{}_meter_service", config.service_name), | ||
| export_interval_secs = 5, | ||
| "OTEL metrics exporter initialized" | ||
| ); | ||
|
|
||
| let tracer = Self::instrument_tracer_provider(config, endpoint)?; | ||
| info!( | ||
| service_name = format!("{}_trace_service", config.service_name), | ||
| "OTEL tracer exporter initialized" | ||
| ); | ||
|
|
||
| let logger = Self::instrument_logger_provider(config, endpoint)?; | ||
| info!(service_name = format!("{}_logs_service", config.service_name), "OTEL logs exporter initialized"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these could be debug logs? do we really need all of these info logs?
| } else if target.starts_with("orchestrator") { | ||
| "-" | ||
| } else { | ||
| "EXTERNAL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confused about external, atlantic is external only?
| // Atlantic Service Metrics | ||
| pub atlantic_api_call_duration: Gauge<f64>, | ||
| pub atlantic_api_calls_total: Counter<f64>, | ||
| pub atlantic_api_errors_total: Counter<f64>, | ||
| pub atlantic_api_retries_total: Counter<f64>, | ||
| pub atlantic_data_transfer_bytes: Counter<f64>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we aim for very specific metrics tho? like how many proof generation jobs, proof registration on L1 jobs and their respective numbers? transfer_bytes I am not sure how useful that would be
rest of the metrics as well, I think it would be make more sense with respective calls?
| if !success { | ||
| let error_attrs = [ | ||
| KeyValue::new("operation", operation.to_string()), | ||
| KeyValue::new("error_type", error_type.unwrap_or("unknown").to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the operation was not a success, we should ensure that error_type is something atleast?
| } | ||
|
|
||
| /// Get response bytes if available (for metrics) | ||
| pub fn response_bytes(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for what exactly?
Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring
Pull Request type
Please add the labels corresponding to the type of changes your PR introduces:
What is the current behavior?
Resolves: #NA
Problems we had:
Unreliable Atlantic API calls - Network errors like
hyper::Error(IncompleteMessage)caused immediate job failures without retry attempts, leading to unnecessary job failures on transient network issues.Code duplication - API key header injection was duplicated 4 times across different methods, making maintenance difficult and error-prone.
Intertwined concerns - HTTP transport logic, API-specific logic, and business logic were all mixed together in single methods, making the code hard to test and maintain.
Insufficient observability - Lack of comprehensive logging and metrics made it difficult to:
Poor separation of concerns - No clear architectural layering made it hard to understand code flow and add new features.
Security concerns - API keys transmitted via query parameters instead of headers (less secure).
What is the new behavior?
Solutions implemented:
1. Clean Four-Layer Architecture (
orchestrator/crates/prover-clients/atlantic-service/)Introduced a clean separation of concerns with four distinct layers:
Key improvements:
ApiKeyAuthcomponentHttpResponseClassifierused by all error handling2. Security Enhancement: Header-Based Authentication
.query_param("apiKey", &api_key)(exposed in logs, browser history).header("api-key", auth.header_value())(industry best practice)create_bucket,close_bucket,add_job,submit_l2_queryApiKeyAuth::new()3. Universal Retry Mechanism
HttpResponseClassifier:add_job,get_job_status,get_bucket,create_bucket,close_bucket,get_artifacts,get_proof_by_task_id,submit_l2_queryDoes this introduce a breaking change?
No
Migration required: None - all changes are internal refactoring