Skip to content

Conversation

@heemankv
Copy link
Contributor

@heemankv heemankv commented Dec 3, 2025

Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring

Pull Request type

Please add the labels corresponding to the type of changes your PR introduces:

  • Feature
  • Refactoring (no functional changes, no API changes)
  • Bugfix
  • Code style update (formatting, renaming)
  • Build-related changes
  • Documentation content changes
  • Testing
  • Other (please describe):

What is the current behavior?

Resolves: #NA

Problems we had:

  1. Unreliable Atlantic API calls - Network errors like hyper::Error(IncompleteMessage) caused immediate job failures without retry attempts, leading to unnecessary job failures on transient network issues.

  2. Code duplication - API key header injection was duplicated 4 times across different methods, making maintenance difficult and error-prone.

  3. Intertwined concerns - HTTP transport logic, API-specific logic, and business logic were all mixed together in single methods, making the code hard to test and maintain.

  4. Insufficient observability - Lack of comprehensive logging and metrics made it difficult to:

    • Debug Atlantic API failures
    • Monitor API performance and latency
    • Track retry patterns and success rates
    • Measure data transfer volumes
    • Identify bottlenecks in the proving pipeline
  5. Poor separation of concerns - No clear architectural layering made it hard to understand code flow and add new features.

  6. Security concerns - API keys transmitted via query parameters instead of headers (less secure).

What is the new behavior?

Solutions implemented:

1. Clean Four-Layer Architecture (orchestrator/crates/prover-clients/atlantic-service/)

Introduced a clean separation of concerns with four distinct layers:

┌─────────────────────────────────────────────────────────────────┐
│                    CLIENT LAYER (client.rs)                      │
│  - Public API: create_bucket, add_job, get_bucket, etc.         │
│  - Retry orchestration via retry_request()                      │
│  - OTEL metrics recording                                        │
│  - INFO-level logging for significant events                     │
│  - 729 lines (reduced from ~1000 lines)                         │
└────────────────────────────┬────────────────────────────────────┘
                             │ delegates to
┌────────────────────────────▼────────────────────────────────────┐
│                     API LAYER (api.rs) - NEW                     │
│  - AtlanticApiOperations with build_* and parse_* methods       │
│  - Request building (forms, JSON bodies, paths)                  │
│  - Response parsing (JSON deserialization)                       │
│  - DEBUG-level logging for request/response details              │
│  - 546 lines of pure business logic                             │
└────────────────────────────┬────────────────────────────────────┘
                             │ uses
┌────────────────────────────▼────────────────────────────────────┐
│              TRANSPORT LAYER (transport.rs) - NEW                │
│  - ApiKeyAuth: API key validation and header injection          │
│  - HttpResponseClassifier: Infrastructure vs API error detection │
│  - 285 lines with 15 unit tests                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│              PROVING LAYER (proving.rs) - NEW                    │
│  - ProvingLayer trait for network-specific configuration        │
│  - EthereumLayer: No additional params                          │
│  - StarknetLayer: Adds layout/result params                     │
│  - 112 lines of network abstraction                             │
└─────────────────────────────────────────────────────────────────┘

Key improvements:

  • Eliminated code duplication: 4 identical API key blocks → 1 ApiKeyAuth component
  • Consolidated error classification: Single HttpResponseClassifier used by all error handling
  • Separated concerns: Transport, API, Proving, and Client layers with clear boundaries
  • Improved testability: 15 new unit tests for transport layer (API key validation, error classification)
  • Better maintainability: Changes to authentication or request structure now in one place

2. Security Enhancement: Header-Based Authentication

  • Migrated API key transmission from query parameters to HTTP headers:
    • Before: .query_param("apiKey", &api_key) (exposed in logs, browser history)
    • After: .header("api-key", auth.header_value()) (industry best practice)
  • Applied to 4 authenticated operations: create_bucket, close_bucket, add_job, submit_l2_query
  • Centralized validation: API key format validated once at construction via ApiKeyAuth::new()

3. Universal Retry Mechanism

  • Automatic retry for all Atlantic API calls (GET and POST operations)
  • Configurable retry policy: 3 attempts with 2-second delays between retries
  • Smart error detection via HttpResponseClassifier:
    • Retries infrastructure errors: timeouts, incomplete messages, 502/503/504, HTML error pages
    • No retry for API errors: 4xx/5xx responses with proper JSON error messages
  • Applied to all 8 API operations: add_job, get_job_status, get_bucket, create_bucket, close_bucket, get_artifacts, get_proof_by_task_id, submit_l2_query

Does this introduce a breaking change?

No

  • ✅ All public API signatures unchanged (backward compatible)
  • ✅ No changes to environment variables
  • ✅ No changes to ProverClient interface
  • ✅ Existing functionality preserved
  • ✅ New layers are internal implementation details
  • ✅ Metrics and logging behavior unchanged
  • ✅ Integration tests still pass (with credentials)

Migration required: None - all changes are internal refactoring

@heemankv heemankv marked this pull request as ready for review December 3, 2025 17:24
@heemankv heemankv changed the title Orchestrator/better atlantic logs Atlantic Service: Enhanced Logging, Metrics & Retry Mechanism Dec 3, 2025
@heemankv heemankv marked this pull request as draft December 4, 2025 04:07
@heemankv heemankv self-assigned this Dec 4, 2025
@heemankv heemankv marked this pull request as ready for review December 8, 2025 19:07
@heemankv heemankv changed the title Atlantic Service: Enhanced Logging, Metrics & Retry Mechanism Atlantic Service: Enhanced Logging, Metrics, Retry & Architectural Refactoring Dec 8, 2025
@heemankv heemankv force-pushed the orchestrator/better-atlantic-logs branch from b4d3088 to f2a47a0 Compare December 8, 2025 20:42
Copy link
Contributor Author

@heemankv heemankv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review done, lgtm

@Mohiiit Mohiiit self-requested a review December 9, 2025 08:13
Copy link
Member

@Mohiiit Mohiiit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments pending

@heemankv heemankv changed the base branch from main to orchestrator/paradex_ready December 9, 2025 09:59
@heemankv heemankv changed the base branch from orchestrator/paradex_ready to main December 9, 2025 09:59
Copy link
Member

@Mohiiit Mohiiit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good refactoring with clean layered architecture! A few suggestions:

Err(last_error.expect("At least one attempt should have been made"))
// All retries exhausted
let total_duration_s = start_time.elapsed().as_secs_f64();
let final_error = last_error.expect("At least one attempt should have been made");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment explaining why this .expect() is safe (e.g., "loop guarantees at least one attempt was made").

return Ok(result);
}
Err(err) => {
let _attempt_duration = attempt_start.elapsed().as_millis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_attempt_duration is computed but never used. Consider removing or using it in per-attempt metrics/logging.

.retry_request(
"submit_l2_query",
&context,
0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_size_bytes is 0 but proof_size is already computed. Consider passing proof_size as u64 for accurate request-size metrics.


impl AtlanticError {
/// Check if this error is retryable
pub fn is_retryable(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a doc comment clarifying that only NetworkError is retryable, and ApiError (including 5xx from real API responses) is intentionally not retried.

&& response_text.contains("\"0\""))
// 404 from infrastructure (not a real Atlantic API 404)
// Real Atlantic 404s would have proper JSON with "error" field
|| (status == StatusCode::NOT_FOUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 404 infrastructure heuristic checks for absence of keywords. A real API 404 without these substrings could be incorrectly retried. Consider also checking for valid JSON structure or a known error field.

|| response_text.contains("<!DOCTYPE")
// Weird JSON-encoded string format from stress testing
// e.g., {"0":"4","1":"0","2":"4",...} encoding "404 page not found"
|| (response_text.starts_with("{\"0\":")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON-encoded string detection is a very specific edge case. Consider adding a comment explaining when/where this was observed, or removing it if no longer relevant.

Comment on lines +44 to 64
info!(
service_name = %config.service_name,
"Initializing OpenTelemetry exporters..."
);

let meter_provider = Self::instrument_metric_provider(config, endpoint)?;
info!(
service_name = format!("{}_meter_service", config.service_name),
export_interval_secs = 5,
"OTEL metrics exporter initialized"
);

let tracer = Self::instrument_tracer_provider(config, endpoint)?;
info!(
service_name = format!("{}_trace_service", config.service_name),
"OTEL tracer exporter initialized"
);

let logger = Self::instrument_logger_provider(config, endpoint)?;
info!(service_name = format!("{}_logs_service", config.service_name), "OTEL logs exporter initialized");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these could be debug logs? do we really need all of these info logs?

} else if target.starts_with("orchestrator") {
"-"
} else {
"EXTERNAL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused about external, atlantic is external only?

Comment on lines +49 to 55
// Atlantic Service Metrics
pub atlantic_api_call_duration: Gauge<f64>,
pub atlantic_api_calls_total: Counter<f64>,
pub atlantic_api_errors_total: Counter<f64>,
pub atlantic_api_retries_total: Counter<f64>,
pub atlantic_data_transfer_bytes: Counter<f64>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we aim for very specific metrics tho? like how many proof generation jobs, proof registration on L1 jobs and their respective numbers? transfer_bytes I am not sure how useful that would be

rest of the metrics as well, I think it would be make more sense with respective calls?

if !success {
let error_attrs = [
KeyValue::new("operation", operation.to_string()),
KeyValue::new("error_type", error_type.unwrap_or("unknown").to_string()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the operation was not a success, we should ensure that error_type is something atleast?

}

/// Get response bytes if available (for metrics)
pub fn response_bytes(&self) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for what exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants