Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring #882

heemankv · 2025-12-03T15:53:20Z

Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring

Pull Request type

Please add the labels corresponding to the type of changes your PR introduces:

What is the current behavior?

Resolves: #NA

Problems we had:

Unreliable Atlantic API calls - Network errors like hyper::Error(IncompleteMessage) caused immediate job failures without retry attempts, leading to unnecessary job failures on transient network issues.
Code duplication - API key header injection was duplicated 4 times across different methods, making maintenance difficult and error-prone.
Intertwined concerns - HTTP transport logic, API-specific logic, and business logic were all mixed together in single methods, making the code hard to test and maintain.
Insufficient observability - Lack of comprehensive logging and metrics made it difficult to:
- Debug Atlantic API failures
- Monitor API performance and latency
- Track retry patterns and success rates
- Measure data transfer volumes
- Identify bottlenecks in the proving pipeline
Poor separation of concerns - No clear architectural layering made it hard to understand code flow and add new features.
Security concerns - API keys transmitted via query parameters instead of headers (less secure).

What is the new behavior?

Solutions implemented:

1. Clean Four-Layer Architecture (`orchestrator/crates/prover-clients/atlantic-service/`)

Introduced a clean separation of concerns with four distinct layers:

┌─────────────────────────────────────────────────────────────────┐
│                    CLIENT LAYER (client.rs)                      │
│  - Public API: create_bucket, add_job, get_bucket, etc.         │
│  - Retry orchestration via retry_request()                      │
│  - OTEL metrics recording                                        │
│  - INFO-level logging for significant events                     │
│  - 729 lines (reduced from ~1000 lines)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │ delegates to
┌────────────────────────────▼────────────────────────────────────┐
│                     API LAYER (api.rs) - NEW                     │
│  - AtlanticApiOperations with build_* and parse_* methods       │
│  - Request building (forms, JSON bodies, paths)                  │
│  - Response parsing (JSON deserialization)                       │
│  - DEBUG-level logging for request/response details              │
│  - 546 lines of pure business logic                             │
└────────────────────────────┬────────────────────────────────────┘
                             │ uses
┌────────────────────────────▼────────────────────────────────────┐
│              TRANSPORT LAYER (transport.rs) - NEW                │
│  - ApiKeyAuth: API key validation and header injection          │
│  - HttpResponseClassifier: Infrastructure vs API error detection │
│  - 285 lines with 15 unit tests                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│              PROVING LAYER (proving.rs) - NEW                    │
│  - ProvingLayer trait for network-specific configuration        │
│  - EthereumLayer: No additional params                          │
│  - StarknetLayer: Adds layout/result params                     │
│  - 112 lines of network abstraction                             │
└─────────────────────────────────────────────────────────────────┘

Key improvements:

✅ Eliminated code duplication: 4 identical API key blocks → 1 ApiKeyAuth component
✅ Consolidated error classification: Single HttpResponseClassifier used by all error handling
✅ Separated concerns: Transport, API, Proving, and Client layers with clear boundaries
✅ Improved testability: 15 new unit tests for transport layer (API key validation, error classification)
✅ Better maintainability: Changes to authentication or request structure now in one place

2. Security Enhancement: Header-Based Authentication

Migrated API key transmission from query parameters to HTTP headers:
- Before: .query_param("apiKey", &api_key) (exposed in logs, browser history)
- After: .header("api-key", auth.header_value()) (industry best practice)
Applied to 4 authenticated operations: create_bucket, close_bucket, add_job, submit_l2_query
Centralized validation: API key format validated once at construction via ApiKeyAuth::new()

3. Universal Retry Mechanism

Automatic retry for all Atlantic API calls (GET and POST operations)
Configurable retry policy: 3 attempts with 2-second delays between retries
Smart error detection via HttpResponseClassifier:
- ✅ Retries infrastructure errors: timeouts, incomplete messages, 502/503/504, HTML error pages
- ❌ No retry for API errors: 4xx/5xx responses with proper JSON error messages
Applied to all 8 API operations: add_job, get_job_status, get_bucket, create_bucket, close_bucket, get_artifacts, get_proof_by_task_id, submit_l2_query

Does this introduce a breaking change?

No

✅ All public API signatures unchanged (backward compatible)
✅ No changes to environment variables
✅ No changes to ProverClient interface
✅ Existing functionality preserved
✅ New layers are internal implementation details
✅ Metrics and logging behavior unchanged
✅ Integration tests still pass (with credentials)

Migration required: None - all changes are internal refactoring

…-alliance/madara into orchestrator/better-atlantic-logs

heemankv

Self review done, lgtm

Mohiiit

Review comments pending

Mohiiit

Good refactoring with clean layered architecture! A few suggestions:

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/client.rs

-        Err(last_error.expect("At least one attempt should have been made"))
+        // All retries exhausted
+        let total_duration_s = start_time.elapsed().as_secs_f64();
+        let final_error = last_error.expect("At least one attempt should have been made");


Consider adding a comment explaining why this .expect() is safe (e.g., "loop guarantees at least one attempt was made").

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/client.rs

+                    return Ok(result);
+                }
+                Err(err) => {
+                    let _attempt_duration = attempt_start.elapsed().as_millis();


_attempt_duration is computed but never used. Consider removing or using it in per-attempt metrics/logging.

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/client.rs

+            .retry_request(
+                "submit_l2_query",
+                &context,
+                0,


data_size_bytes is 0 but proof_size is already computed. Consider passing proof_size as u64 for accurate request-size metrics.

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/error.rs

+
+impl AtlanticError {
+    /// Check if this error is retryable
+    pub fn is_retryable(&self) -> bool {


Consider adding a doc comment clarifying that only NetworkError is retryable, and ApiError (including 5xx from real API responses) is intentionally not retried.

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/transport.rs

+                && response_text.contains("\"0\""))
+            // 404 from infrastructure (not a real Atlantic API 404)
+            // Real Atlantic 404s would have proper JSON with "error" field
+            || (status == StatusCode::NOT_FOUND


The 404 infrastructure heuristic checks for absence of keywords. A real API 404 without these substrings could be incorrectly retried. Consider also checking for valid JSON structure or a known error field.

Mohiiit · 2025-12-09T10:03:45Z

orchestrator/crates/prover-clients/atlantic-service/src/transport.rs

+            || response_text.contains("<!DOCTYPE")
+            // Weird JSON-encoded string format from stress testing
+            // e.g., {"0":"4","1":"0","2":"4",...} encoding "404 page not found"
+            || (response_text.starts_with("{\"0\":")


The JSON-encoded string detection is a very specific edge case. Consider adding a comment explaining when/where this was observed, or removing it if no longer relevant.

Mohiiit · 2025-12-09T10:07:39Z

orchestrator/src/utils/instrument.rs

+                info!(
+                    service_name = %config.service_name,
+                    "Initializing OpenTelemetry exporters..."
+                );
+
                let meter_provider = Self::instrument_metric_provider(config, endpoint)?;
+                info!(
+                    service_name = format!("{}_meter_service", config.service_name),
+                    export_interval_secs = 5,
+                    "OTEL metrics exporter initialized"
+                );
+
                let tracer = Self::instrument_tracer_provider(config, endpoint)?;
+                info!(
+                    service_name = format!("{}_trace_service", config.service_name),
+                    "OTEL tracer exporter initialized"
+                );
+
                let logger = Self::instrument_logger_provider(config, endpoint)?;
+                info!(service_name = format!("{}_logs_service", config.service_name), "OTEL logs exporter initialized");



nit: these could be debug logs? do we really need all of these info logs?

Mohiiit · 2025-12-09T10:09:56Z

orchestrator/src/utils/logging.rs

+    } else if target.starts_with("orchestrator") {
+        "-"
+    } else {
+        "EXTERNAL"


a bit confused about external, atlantic is external only?

Mohiiit · 2025-12-09T10:11:56Z

orchestrator/src/utils/metrics.rs

+    // Atlantic Service Metrics
+    pub atlantic_api_call_duration: Gauge<f64>,
+    pub atlantic_api_calls_total: Counter<f64>,
+    pub atlantic_api_errors_total: Counter<f64>,
+    pub atlantic_api_retries_total: Counter<f64>,
+    pub atlantic_data_transfer_bytes: Counter<f64>,
 }


shall we aim for very specific metrics tho? like how many proof generation jobs, proof registration on L1 jobs and their respective numbers? transfer_bytes I am not sure how useful that would be

rest of the metrics as well, I think it would be make more sense with respective calls?

Mohiiit · 2025-12-09T10:13:05Z

orchestrator/src/utils/metrics_recorder.rs

+        if !success {
+            let error_attrs = [
+                KeyValue::new("operation", operation.to_string()),
+                KeyValue::new("error_type", error_type.unwrap_or("unknown").to_string()),


if the operation was not a success, we should ensure that error_type is something atleast?

Mohiiit · 2025-12-09T10:16:52Z

orchestrator/crates/prover-clients/atlantic-service/src/error.rs

+    }
+
+    /// Get response bytes if available (for metrics)
+    pub fn response_bytes(&self) -> u64 {


this is for what exactly?

heemankv added 2 commits December 3, 2025 20:01

update: impl better retry + logging for orchestrator atlantic service

ede93a4

clippy fix

c6b61a9

github-project-automation bot added this to Madara Dec 3, 2025

heemankv added 3 commits December 3, 2025 21:44

update: add service name for logs

d747844

update: added query metrics

8a00e08

fix: clippy

a1e73b8

heemankv marked this pull request as ready for review December 3, 2025 17:24

heemankv changed the title ~~Orchestrator/better atlantic logs~~ Atlantic Service: Enhanced Logging, Metrics & Retry Mechanism Dec 3, 2025

update: otel initialisation logs

6340ee0

heemankv marked this pull request as draft December 4, 2025 04:07

heemankv added 2 commits December 4, 2025 10:40

update: added metrics

aba2d00

lint fixes

20c839e

heemankv self-assigned this Dec 4, 2025

heemankv added 5 commits December 8, 2025 23:47

updated: added infrastructure error handling

26cd3f1

Merge branch 'main' into orchestrator/better-atlantic-logs

5ff5fb9

update:self review resolution

ce5e7e6

Merge branch 'orchestrator/better-atlantic-logs' of github.com:madara…

184c080

…-alliance/madara into orchestrator/better-atlantic-logs

update: migrate query_param from query to header

127eea0

heemankv marked this pull request as ready for review December 8, 2025 19:07

heemankv added 3 commits December 9, 2025 00:55

update: improved code quality

8553706

update: refactored call layers

5fa433c

refactored atlantic crate

f8d440f

heemankv changed the title ~~Atlantic Service: Enhanced Logging, Metrics & Retry Mechanism~~ Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring Dec 8, 2025

update: self review fixes

f2a47a0

heemankv force-pushed the orchestrator/better-atlantic-logs branch from b4d3088 to f2a47a0 Compare December 8, 2025 20:42

heemankv commented Dec 9, 2025

View reviewed changes

Mohiiit self-requested a review December 9, 2025 08:13

Mohiiit reviewed Dec 9, 2025

View reviewed changes

heemankv changed the base branch from main to orchestrator/paradex_ready December 9, 2025 09:59

heemankv changed the base branch from orchestrator/paradex_ready to main December 9, 2025 09:59

Mohiiit reviewed Dec 9, 2025

View reviewed changes

Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring #882

Are you sure you want to change the base?

Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring #882

Conversation

heemankv commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring

Pull Request type

What is the current behavior?

What is the new behavior?

1. Clean Four-Layer Architecture (orchestrator/crates/prover-clients/atlantic-service/)

2. Security Enhancement: Header-Based Authentication

3. Universal Retry Mechanism

Does this introduce a breaking change?

Uh oh!

heemankv left a comment

Choose a reason for hiding this comment

Uh oh!

Mohiiit left a comment

Choose a reason for hiding this comment

Uh oh!

Mohiiit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heemankv commented Dec 3, 2025 •

edited

Loading

1. Clean Four-Layer Architecture (`orchestrator/crates/prover-clients/atlantic-service/`)

Mohiiit left a comment •

edited

Loading