fix(retrieval): surface silent IPNI timeouts in logs#539
Open
SgtPooki wants to merge 1 commit into
Open
Conversation
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired: 1. IpniVerificationService.verify checked signal?.aborted before verificationSignal.aborted. On a race where both signals were asserted by the time the catch handler ran, signal.throwIfAborted() re-threw before the ipni_verification_timed_out log could fire. Inner check now runs first so the inner timeout log fires whenever the inner timeout signal aborted, regardless of outer state. 2. RetrievalService.verifyIpniForRetrieval's catch returned a silent failure.timedout when signal?.aborted, with no log. Added retrieval_ipni_verification_timed_out so the retrieval-side caller records the abort. Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
a9c9f5a to
249fcc1
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Two small observability fixes for IPNI verification timeouts in the retrieval check. The first reorders the catch handler in IpniVerificationService.verify so the inner-timeout log fires even when the outer job signal aborts in the same tick. The second adds a retrieval_ipni_verification_timed_out warn log in RetrievalService.verifyIpniForRetrieval's outer-abort branch, replacing a silent failure.timedout return. Both changes have accompanying unit tests.
Changes:
- Swap catch-order in
ipni-verification.service.tssotimeoutSignal.abortedis checked before the externalsignal?.abortedrethrow. - Emit
retrieval_ipni_verification_timed_outlog inretrieval.service.tswhen the outer signal aborts during IPNI verify. - Add unit tests for both code paths.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| apps/backend/src/ipni/ipni-verification.service.ts | Reorder catch checks so inner-timeout log wins same-tick races. |
| apps/backend/src/ipni/ipni-verification.service.spec.ts | Test that ipni_verification_timed_out fires when both signals abort in the same tick. |
| apps/backend/src/retrieval/retrieval.service.ts | Add structured warn log on outer-signal abort in verifyIpniForRetrieval. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Test that retrieval_ipni_verification_timed_out fires when outer signal aborts and IPNI verify throws. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Two adjacent observability bugs left IPNI failures invisible whenever the outer pg-boss retrieval-job timeout fired:
`IpniVerificationService.verify` check-order. Catch checked `signal?.aborted` before `verificationSignal.aborted`. On a race where both signals had asserted by the time the catch handler ran, `signal.throwIfAborted()` re-threw before the `ipni_verification_timed_out` log could fire. The inner-timeout log path was effectively dead code whenever outer ≤ inner. Inner check now runs first so the log fires whenever the inner timer aborted.
`RetrievalService.verifyIpniForRetrieval` silent abort. The catch's signal-aborted branch returned a silent `failure.timedout` with no event log. Added `retrieval_ipni_verification_timed_out` so the retrieval-side caller records the abort.
Implements option C from #524. Stacks on #538.
Files
How to verify
```
pnpm --filter dealbot-backend typecheck
pnpm --filter dealbot-backend lint
pnpm --filter dealbot-backend test
```
After deploy, staging logs should show `ipni_verification_timed_out` events 1:1 with `retrievalStatus=failure.timedout` deltas for SPs without IPNI advertisements (currently SP 5 Mongo2Stor, SP 17 superusey-calib in staging).
Notes
Base branch is `feat/retrieval-parallel-transport-counter` (#538). Re-target to `main` after that PR merges.