Skip to content

Redesign ipniVerifyMs to disambiguate verify outcomes and tighten histogram buckets #484

@SgtPooki

Description

@SgtPooki

Problem

ipniVerifyMs is a single histogram with no outcome dimension. It records duration regardless of result (verified, signal-aborted, monitoring-skipped, error). Dashboards cannot filter to success-only or compute success-rate from it. Bucket boundaries [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000] jump from 60s → 120s → 300s, so quantileInterpolatedWeighted linearly extrapolates a deep tail (e.g. p90 ≈ 264s) from a small number of >120s samples.

Two related call paths feed the metric:

  1. Verify path (apps/backend/src/ipni/ipni-verification.service.ts:25-142): polls IPNI indexer, bounded by IPNI_VERIFICATION_TIMEOUT_MS=60000 via AbortSignal.timeout. Observed at apps/backend/src/deal-addons/strategies/ipni.strategy.ts:406 and apps/backend/src/retrieval/retrieval.service.ts:427.
  2. Monitoring fallback (apps/backend/src/deal-addons/strategies/ipni.strategy.ts:303-332): when monitorPieceStatus (separate POLLING_TIMEOUT_MS=120000) times out, the verify call still runs after — but the wall-time framing differs and the metric cannot distinguish.

Proposed design (synthesis of cursor + gemini peer reviews)

A. Add outcome label to ipniVerifyMs

outcome ∈ { verified, timeout, error } — observed for every attempted verification. Lets dashboards filter {outcome=\"verified\"} for true latency SLI. Recording timeout durations under outcome=timeout exposes potential AbortSignal-leak bugs (durations exceeding the budget by a wide margin).

B. Add ipniVerifySkippedTotal counter

Incremented when monitoring failure prevents a verify attempt. Avoids polluting the histogram with synthetic-zero or "skipped" durations.

C. Tighten histogram buckets

Current: [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000]
Proposed: [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 90000, 120000, 180000, 240000, 300000]

Closes the 60s→120s→300s gap that drives misleading p90 interpolation.

D. (Optional) Separate ipniMonitorStatusMs histogram

For monitorPieceStatus Path B latency, if observability into the upstream gate is wanted.

Files

  • apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts:101-110 (metric definition + buckets)
  • apps/backend/src/metrics-prometheus/check-metrics.service.ts:166-215 (observation site)
  • apps/backend/src/deal-addons/strategies/ipni.strategy.ts:282-410 (Path A/B fork, observation call site for deal flow)
  • apps/backend/src/retrieval/retrieval.service.ts:418-427 (observation call site for retrieval flow)
  • apps/backend/src/ipni/ipni-verification.service.ts:25-142 (verify implementation)

Migration

  • Keep ipniVerifyMs name. Add label + buckets in one release.
  • Update Better Stack dashboards to filter outcome=\"verified\" once label is populated.
  • After one retention window (30d), simplify or remove old chart panels.

Acceptance criteria

  • ipniVerifyMs carries outcome label {verified|timeout|error}.
  • ipniVerifySkippedTotal counter exists and increments on monitoring-skipped verifications.
  • Histogram buckets include 90s, 120s, 180s, 240s.
  • Dashboard chart filters by outcome=\"verified\" for SLI.
  • No new high-cardinality label combinations introduced (3 outcomes × existing labels only).

Related

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestready-for-workTriaged: scope, plan, and DoD are clear; contributor can pick up

Type

No type
No fields configured for issues without a type.

Projects

Status
🎉 Done

Relationships

None yet

Development

No branches or pull requests

Issue actions