Skip to content

fix(deal): combine FWSS, PDPVerifier, and SP-HTTP probes for dataset liveness#537

Open
SgtPooki wants to merge 3 commits into
mainfrom
fix/datasetlive-multi-probe
Open

fix(deal): combine FWSS, PDPVerifier, and SP-HTTP probes for dataset liveness#537
SgtPooki wants to merge 3 commits into
mainfrom
fix/datasetlive-multi-probe

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

What changed

DealService.isDataSetLive(providerAddress, dataSetId, signal) now runs three independent liveness probes and returns true only when all agree:

  1. WarmStorage.validateDataSet (chain) - catches FWSS-side termination. Preserves the PR fix: handle PDP-terminated datasets via data_set_creation repair #518 behaviour: rethrows on non-terminal errors so a transient RPC outage cannot misclassify a healthy dataset as terminated.
  2. PDPVerifier.dataSetLive (chain, via @filoz/synapse-core/pdp-verifier) - catches PDPVerifier-side termination.
  3. POST /pdp/data-sets/{id}/pieces with an empty body (off-chain, unauthenticated) - catches Curio's unrecoverable_proving_failure_epoch state, where the SP refuses addPieces with HTTP 409 while both chain signals still report the dataset live.

The SP HTTP probe is the only signal that detects PDP-terminated datasets on sp-playground on calibration today, where 197/197 provider-24 datasets return 409 from the addPieces endpoint while both chain probes report them live. Curio's handler returns 409 exclusively for the terminated check on this endpoint (curio/pdp/handlers_add.go#L302-L305), so the probe matches on status code rather than body text.

The SP HTTP probe treats any non-409 response (including 401, 404, 5xx, and network errors) as live to avoid triggering destructive repair on transient SP outages or future auth changes.

getDataSetProvisioningStatus and the existing repair flow (data-set-creation.handler.ts -> repairTerminatedDataSet) are untouched; the stronger probe automatically widens what they classify as terminated.

How to verify

pnpm --filter dealbot-backend test src/deal/deal.service.spec.ts

New cases cover each probe returning false in isolation, transient errors, and the request shape sent to the SP.

Notes

  • Adds one fetch per dataset liveness check. Timeout capped at 10s, combined with caller AbortSignal.
  • DATASET_CREATIONS_PER_SP_PER_HOUR=0.25 on staging means existing 197 terminated sp-playground datasets will repair at ~1 per 4 hours per provider once deployed. No backfill script needed; the scheduled data_set_creation job picks them up via the wider isDataSetLive.

…liveness

`isDataSetLive` returns true only when all three signals agree:
WarmStorage.validateDataSet, PDPVerifier.dataSetLive, and an unauthenticated
POST to the SP's `/pdp/data-sets/{id}/pieces` endpoint. Curio returns HTTP 409
on that endpoint when `unrecoverable_proving_failure_epoch` is set, which is
the only signal observable when a dataset is dead on the SP but chain still
reports it as live.

Chain probes rethrow on transient errors so transient outages don't get
misclassified as termination. The SP HTTP probe treats any non-409 response
(including auth failures and network errors) as live.

Threads `providerAddress` through `isDataSetLive` so the SP probe can resolve
the serviceURL from the provider registry.
Copilot AI review requested due to automatic review settings May 14, 2026 16:54
@FilOzzy FilOzzy added this to FOC May 14, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates backend dataset liveness classification to be more robust by composing multiple independent termination signals (on-chain FWSS, on-chain PDPVerifier, and an SP HTTP probe) so that repair can trigger for datasets Curio has already marked unrecoverably terminated even when chain state hasn’t propagated yet.

Changes:

  • Expands DealService.isDataSetLive(...) into a composite check that runs FWSS validateDataSet, PDPVerifier dataSetLive, and an SP HTTP POST /pdp/data-sets/{id}/pieces probe (409 => terminated).
  • Threads providerAddress through call sites that need SP registry info for the HTTP probe.
  • Adds unit tests covering the three probes, including SP HTTP 409 behavior, request shape, and transient-error handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
apps/backend/src/deal/deal.service.ts Implements composite liveness probing (FWSS + PDPVerifier + SP HTTP) and updates provisioning status logic to use it.
apps/backend/src/deal/deal.service.spec.ts Adds mocks/stubs and new test cases validating the combined probe behavior and SP HTTP request details.

Comment thread apps/backend/src/deal/deal.service.ts Outdated
Comment thread apps/backend/src/deal/deal.service.ts
Comment thread apps/backend/src/deal/deal.service.ts
SgtPooki added 2 commits May 14, 2026 13:27
…errors

`Promise.all` previously let a transient chain-probe rejection mask a
conclusive SP HTTP 409. Switch to `Promise.allSettled` and return false when
any probe positively reports terminated; only rethrow when no probe reported
termination.

Also require the SP HTTP probe to match `unrecoverable proving failure` in
the response body in addition to HTTP 409. Defends against a future Curio
reusing 409 for a non-terminal conflict, which would otherwise trigger
destructive `terminateDataSet` + deal cleanup.

Tests cover both the SP-409-rescues-transient-chain-error path and the
409-with-different-body-treated-as-live path.
Copy link
Copy Markdown
Collaborator Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review.. a little heavy on the tests here, but at least our expectations are documented.

@SgtPooki SgtPooki self-assigned this May 14, 2026
@SgtPooki SgtPooki moved this from 📌 Triage to 🔎 Awaiting review in FOC May 14, 2026
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Just one comment.

}
}

protected async probePdpVerifierDataSetLive(dataSetId: bigint, signal?: AbortSignal): Promise<boolean> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate dataSetLive call to the PDP verifier? warmStorageService.validateDataSet already checks whether the dataset is live on the pdp verifier, so this seems like a redundant rpc call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔎 Awaiting review

Development

Successfully merging this pull request may close these issues.

4 participants