fix(deal): combine FWSS, PDPVerifier, and SP-HTTP probes for dataset liveness#537
fix(deal): combine FWSS, PDPVerifier, and SP-HTTP probes for dataset liveness#537SgtPooki wants to merge 3 commits into
Conversation
…liveness
`isDataSetLive` returns true only when all three signals agree:
WarmStorage.validateDataSet, PDPVerifier.dataSetLive, and an unauthenticated
POST to the SP's `/pdp/data-sets/{id}/pieces` endpoint. Curio returns HTTP 409
on that endpoint when `unrecoverable_proving_failure_epoch` is set, which is
the only signal observable when a dataset is dead on the SP but chain still
reports it as live.
Chain probes rethrow on transient errors so transient outages don't get
misclassified as termination. The SP HTTP probe treats any non-409 response
(including auth failures and network errors) as live.
Threads `providerAddress` through `isDataSetLive` so the SP probe can resolve
the serviceURL from the provider registry.
There was a problem hiding this comment.
Pull request overview
Updates backend dataset liveness classification to be more robust by composing multiple independent termination signals (on-chain FWSS, on-chain PDPVerifier, and an SP HTTP probe) so that repair can trigger for datasets Curio has already marked unrecoverably terminated even when chain state hasn’t propagated yet.
Changes:
- Expands
DealService.isDataSetLive(...)into a composite check that runs FWSSvalidateDataSet, PDPVerifierdataSetLive, and an SP HTTPPOST /pdp/data-sets/{id}/piecesprobe (409 => terminated). - Threads
providerAddressthrough call sites that need SP registry info for the HTTP probe. - Adds unit tests covering the three probes, including SP HTTP 409 behavior, request shape, and transient-error handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| apps/backend/src/deal/deal.service.ts | Implements composite liveness probing (FWSS + PDPVerifier + SP HTTP) and updates provisioning status logic to use it. |
| apps/backend/src/deal/deal.service.spec.ts | Adds mocks/stubs and new test cases validating the combined probe behavior and SP HTTP request details. |
…errors `Promise.all` previously let a transient chain-probe rejection mask a conclusive SP HTTP 409. Switch to `Promise.allSettled` and return false when any probe positively reports terminated; only rethrow when no probe reported termination. Also require the SP HTTP probe to match `unrecoverable proving failure` in the response body in addition to HTTP 409. Defends against a future Curio reusing 409 for a non-terminal conflict, which would otherwise trigger destructive `terminateDataSet` + deal cleanup. Tests cover both the SP-409-rescues-transient-chain-error path and the 409-with-different-body-treated-as-live path.
SgtPooki
left a comment
There was a problem hiding this comment.
self review.. a little heavy on the tests here, but at least our expectations are documented.
silent-cipher
left a comment
There was a problem hiding this comment.
Overall looks good to me. Just one comment.
| } | ||
| } | ||
|
|
||
| protected async probePdpVerifierDataSetLive(dataSetId: bigint, signal?: AbortSignal): Promise<boolean> { |
There was a problem hiding this comment.
Why do we need a separate dataSetLive call to the PDP verifier? warmStorageService.validateDataSet already checks whether the dataset is live on the pdp verifier, so this seems like a redundant rpc call.
What changed
DealService.isDataSetLive(providerAddress, dataSetId, signal)now runs three independent liveness probes and returns true only when all agree:WarmStorage.validateDataSet(chain) - catches FWSS-side termination. Preserves the PR fix: handle PDP-terminated datasets via data_set_creation repair #518 behaviour: rethrows on non-terminal errors so a transient RPC outage cannot misclassify a healthy dataset as terminated.PDPVerifier.dataSetLive(chain, via@filoz/synapse-core/pdp-verifier) - catches PDPVerifier-side termination.POST /pdp/data-sets/{id}/pieceswith an empty body (off-chain, unauthenticated) - catches Curio'sunrecoverable_proving_failure_epochstate, where the SP refusesaddPieceswith HTTP 409 while both chain signals still report the dataset live.The SP HTTP probe is the only signal that detects PDP-terminated datasets on
sp-playgroundon calibration today, where 197/197 provider-24 datasets return 409 from the addPieces endpoint while both chain probes report them live. Curio's handler returns 409 exclusively for the terminated check on this endpoint (curio/pdp/handlers_add.go#L302-L305), so the probe matches on status code rather than body text.The SP HTTP probe treats any non-409 response (including 401, 404, 5xx, and network errors) as live to avoid triggering destructive repair on transient SP outages or future auth changes.
getDataSetProvisioningStatusand the existing repair flow (data-set-creation.handler.ts->repairTerminatedDataSet) are untouched; the stronger probe automatically widens what they classify as terminated.How to verify
pnpm --filter dealbot-backend test src/deal/deal.service.spec.tsNew cases cover each probe returning false in isolation, transient errors, and the request shape sent to the SP.
Notes
AbortSignal.DATASET_CREATIONS_PER_SP_PER_HOUR=0.25on staging means existing 197 terminatedsp-playgrounddatasets will repair at ~1 per 4 hours per provider once deployed. No backfill script needed; the scheduleddata_set_creationjob picks them up via the widerisDataSetLive.