Problem
ipniVerifyMs is a single histogram with no outcome dimension. It records duration regardless of result (verified, signal-aborted, monitoring-skipped, error). Dashboards cannot filter to success-only or compute success-rate from it. Bucket boundaries [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000] jump from 60s → 120s → 300s, so quantileInterpolatedWeighted linearly extrapolates a deep tail (e.g. p90 ≈ 264s) from a small number of >120s samples.
Two related call paths feed the metric:
- Verify path (
apps/backend/src/ipni/ipni-verification.service.ts:25-142): polls IPNI indexer, bounded by IPNI_VERIFICATION_TIMEOUT_MS=60000 via AbortSignal.timeout. Observed at apps/backend/src/deal-addons/strategies/ipni.strategy.ts:406 and apps/backend/src/retrieval/retrieval.service.ts:427.
- Monitoring fallback (
apps/backend/src/deal-addons/strategies/ipni.strategy.ts:303-332): when monitorPieceStatus (separate POLLING_TIMEOUT_MS=120000) times out, the verify call still runs after — but the wall-time framing differs and the metric cannot distinguish.
Proposed design (synthesis of cursor + gemini peer reviews)
A. Add outcome label to ipniVerifyMs
outcome ∈ { verified, timeout, error } — observed for every attempted verification. Lets dashboards filter {outcome=\"verified\"} for true latency SLI. Recording timeout durations under outcome=timeout exposes potential AbortSignal-leak bugs (durations exceeding the budget by a wide margin).
B. Add ipniVerifySkippedTotal counter
Incremented when monitoring failure prevents a verify attempt. Avoids polluting the histogram with synthetic-zero or "skipped" durations.
C. Tighten histogram buckets
Current: [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000]
Proposed: [10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 90000, 120000, 180000, 240000, 300000]
Closes the 60s→120s→300s gap that drives misleading p90 interpolation.
D. (Optional) Separate ipniMonitorStatusMs histogram
For monitorPieceStatus Path B latency, if observability into the upstream gate is wanted.
Files
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts:101-110 (metric definition + buckets)
apps/backend/src/metrics-prometheus/check-metrics.service.ts:166-215 (observation site)
apps/backend/src/deal-addons/strategies/ipni.strategy.ts:282-410 (Path A/B fork, observation call site for deal flow)
apps/backend/src/retrieval/retrieval.service.ts:418-427 (observation call site for retrieval flow)
apps/backend/src/ipni/ipni-verification.service.ts:25-142 (verify implementation)
Migration
- Keep
ipniVerifyMs name. Add label + buckets in one release.
- Update Better Stack dashboards to filter
outcome=\"verified\" once label is populated.
- After one retention window (30d), simplify or remove old chart panels.
Acceptance criteria
Related
Problem
ipniVerifyMsis a single histogram with no outcome dimension. It records duration regardless of result (verified, signal-aborted, monitoring-skipped, error). Dashboards cannot filter to success-only or compute success-rate from it. Bucket boundaries[10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000]jump from 60s → 120s → 300s, soquantileInterpolatedWeightedlinearly extrapolates a deep tail (e.g. p90 ≈ 264s) from a small number of >120s samples.Two related call paths feed the metric:
apps/backend/src/ipni/ipni-verification.service.ts:25-142): polls IPNI indexer, bounded byIPNI_VERIFICATION_TIMEOUT_MS=60000viaAbortSignal.timeout. Observed atapps/backend/src/deal-addons/strategies/ipni.strategy.ts:406andapps/backend/src/retrieval/retrieval.service.ts:427.apps/backend/src/deal-addons/strategies/ipni.strategy.ts:303-332): whenmonitorPieceStatus(separatePOLLING_TIMEOUT_MS=120000) times out, the verify call still runs after — but the wall-time framing differs and the metric cannot distinguish.Proposed design (synthesis of cursor + gemini peer reviews)
A. Add
outcomelabel toipniVerifyMsoutcome ∈ { verified, timeout, error }— observed for every attempted verification. Lets dashboards filter{outcome=\"verified\"}for true latency SLI. Recording timeout durations underoutcome=timeoutexposes potential AbortSignal-leak bugs (durations exceeding the budget by a wide margin).B. Add
ipniVerifySkippedTotalcounterIncremented when monitoring failure prevents a verify attempt. Avoids polluting the histogram with synthetic-zero or "skipped" durations.
C. Tighten histogram buckets
Current:
[10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 120000, 300000]Proposed:
[10, 50, 100, 500, 1000, 2000, 5000, 10000, 30000, 60000, 90000, 120000, 180000, 240000, 300000]Closes the 60s→120s→300s gap that drives misleading p90 interpolation.
D. (Optional) Separate
ipniMonitorStatusMshistogramFor
monitorPieceStatusPath B latency, if observability into the upstream gate is wanted.Files
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts:101-110(metric definition + buckets)apps/backend/src/metrics-prometheus/check-metrics.service.ts:166-215(observation site)apps/backend/src/deal-addons/strategies/ipni.strategy.ts:282-410(Path A/B fork, observation call site for deal flow)apps/backend/src/retrieval/retrieval.service.ts:418-427(observation call site for retrieval flow)apps/backend/src/ipni/ipni-verification.service.ts:25-142(verify implementation)Migration
ipniVerifyMsname. Add label + buckets in one release.outcome=\"verified\"once label is populated.Acceptance criteria
ipniVerifyMscarriesoutcomelabel{verified|timeout|error}.ipniVerifySkippedTotalcounter exists and increments on monitoring-skipped verifications.outcome=\"verified\"for SLI.Related