feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150
Open
adequatelimited wants to merge 5 commits intomasterfrom
Open
feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150adequatelimited wants to merge 5 commits intomasterfrom
adequatelimited wants to merge 5 commits intomasterfrom
Conversation
Introduces the sync-subsystem hardening module (src/syncguard.{h,c}),
used by later phases to add defense-in-depth to the node sync flow.
Provides session-local (in-memory) caches:
- Bad-chain exclusion list: (weight, hash) pairs that failed sync
- Bad-tfile hash cache: SHA-256 hashes of tfiles that failed validation
- Per-peer proof segment cache: last NTFTX trailers advertised by each
quorum candidate, for byte-exact matching against the tfile they
subsequently serve
All caches are cleared at the start of each resync() attempt via
sg_session_reset().
Also exposes:
- sg_hash_file(): SHA-256 of a file on disk, for bad-tfile cache keys
- sg_proof_match_tfile_tail(): byte-exact tail comparison
- sg_validate_proof_chain(): structural chain validation (no PoW)
No behavioral change in this commit — the caches and helpers are
defined but not yet called by any existing code path. That wiring
happens in Phases 2-4.
Adds get_tf_proof() helper to fetch a peer's last NTFTX block trailers
via OP_TF, stored in a per-IP temp file on disk for parallel safety
across the scan_quorum() OMP loop.
scan_quorum() now, for each candidate peer:
1. Captures the peer's advertised (weight, hash, bnum) in thread-local
storage after the existing OP_GET_IPL handshake.
2. Checks the bad-chain exclusion list (Phase 4 infrastructure) before
any expensive proof work.
3. Requests the peer's last NTFTX trailers via get_tf_proof().
4. Validates the proof segment structurally with
sg_validate_proof_chain():
- bnum increments by 1 across consecutive trailers
- phash of trailer[i+1] equals bhash of trailer[i]
- tip bhash equals advertised hash
- tip bnum equals advertised bnum
PoW validation is intentionally skipped here; it is cost-prohibitive
per-peer and is fully validated downstream in validate_tfile_pow().
5. Only peers that pass the proof spot-check AND match the current
high chain hash are admitted to the quorum. Their validated proof
is stored via sg_proof_store() for Phase 3's tfile tail check.
Effect: a peer advertising a fabricated (weight, hash, bnum) triple
without also being able to produce a self-consistent NTFTX proof
segment is rejected from the quorum immediately, without the node
wasting bandwidth downloading their tfile.
Augments the tfile acquisition step in resync() with defense-in-depth:
1. sg_session_reset() at resync() entry clears the per-session
exclusion and proof caches established in Phases 1 and 2.
2. After downloading a candidate tfile (tfile.tmp), hash it via
SHA-256 and consult the bad-tfile cache. If the same content
has already failed validation from another quorum member, skip
this peer immediately without re-running validation.
3. Tail spot-check: the last NTFTX trailers of the downloaded tfile
are compared byte-exactly against the proof segment that this
peer served during scan_quorum(). A mismatch means the peer
served a different chain than it advertised — add the tfile
hash to the bad-tfile cache and drop the peer from the quorum.
4. If the existing validate_tfile() or validate_tfile_pow() checks
fail, or the advertised (weight, bnum) is not met, add the
tfile hash to the bad-tfile cache and continue to the next
quorum member instead of aborting the entire sync.
Previously, any tfile failure immediately returned VERROR from
resync(); now the loop iterates through quorum members, giving the
node a chance to find a good tfile without exiting to restart().
Previously, resync() called restart() when it exhausted all quorum
members for a given chain. restart() exit()s the process, which:
1. Loses the session-local bad-chain and bad-tfile caches
2. Forces the bootstrap process to start from scratch — potentially
picking the same bad chain again if the malicious advertisements
persist across the restart
This commit:
- Adds a highhash parameter to resync() so the function knows which
chain the quorum is targeting
- On quorum exhaustion during gettfile OR getneo, marks the
(weight, hash) pair bad via sg_bad_chain_add() and returns VERROR
instead of calling restart()
- scan_quorum() already consults sg_bad_chain_check() (wired in
Phase 2), so a subsequent re-scan will skip this chain
The result: when a chain's quorum is exhausted, the bootstrap loop
in mochimo.c iterates to the next-best chain automatically, within
the same process (preserving the session caches from Phases 1-3).
Two bugs uncovered during end-to-end WSL smoke testing on mainnet: 1. sg_proof_match_tfile_tail() required the proof segment to sit byte-exactly at the TAIL of the downloaded tfile. If the peer's chain advanced between scan_quorum() (proof fetch) and resync() (full tfile download) — even by a single block — the proof would no longer be at the tail and the check would fire false positives. Since tfile trailers are stored in bnum-continuous order starting from genesis (trailer at byte offset N*sizeof(BTRAILER) has bnum==N), we can instead seek directly to proof[0].bnum's offset and compare the proof's historical trailers in place. The tfile may have advanced past the proof's tip, but the proof's own range is historical and cannot change without a reorg. Renamed to sg_proof_match_tfile() to reflect the new semantics. 2. resync() called sg_session_reset() at entry, which also cleared the proof cache that scan_quorum() had just populated. The proofs must persist across the scan→resync boundary. Session reset is no longer automatic on resync(); the session caches now persist for the lifetime of the process (zero-initialized at startup via static storage, overwritten naturally on re-scan). sg_session_reset() remains available as an exported helper. Also downgraded diagnostic logs in sg_proof_match_tfile() from plog() to pdebug() so they only appear at debug log level.
628b45e to
5eff1b0
Compare
Collaborator
Author
|
This subsystem is a must-have for v3.1, so will need to be included in that code, but is not required for the current audit remediation cycle. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Subsystem overhaul hardening the node sync flow against fabricated or corrupt peer advertisements. Builds on F-09 (#148): where that change ensured quorum members share the same chain, this change ensures the chain they advertise is actually the chain they serve, and that the node degrades gracefully when it is not.
Scope is intentionally narrow: only
src/network.c,src/sync.c,src/bin/mochimo.c, and two new files (src/syncguard.{h,c}). Consensus rules, transaction validation, PoW, ledger format, and all other subsystems are unchanged.Commits
bc9b1cb— Phase 1: data structures and session caches (syncguard.{h,c})1125f13— Phase 2: proof spot-check inscan_quorum()ca39658— Phase 3: tfile spot-check and bad-tfile caching inresync()44d3549— Phase 4: chain re-selection on quorum exhaustion628b45e— fix(syncguard): spot-check proof by bnum offset, not tfile tail(Plus the F-09 commit
0411f05at the base, already merged to master via #149.)Why this overhaul
The F-09 narrow fix solved the quorum assembly determinism problem but left several gaps that the audit identified during review:
restart()(exit)What changed
Phase 1:
src/syncguard.{h,c}— data structures and session cachesSession-local in-memory caches:
(weight, hash)pairs that failed sync this sessionNTFTXtrailers validated for each quorum candidate, keyed by IPAll caches zero-init at process start. They persist across
resync()retries within the same process so the node does not fall into repeated failures against the same bad actors.Also provides:
sg_hash_file()— SHA-256 of a file on disk, for bad-tfile cache keyssg_proof_match_tfile()— byte-exact comparison of cached proof against the corresponding bnum range in a downloaded tfilesg_validate_proof_chain()— structural chain validation (bnum increments by 1, phash links, tip matches advertised). No PoW check; that is too expensive per-peer and is fully validated downstream invalidate_tfile_pow().Phase 2:
src/network.c— proof spot-check inscan_quorum()Adds
get_tf_proof()helper to requestNTFTXtrailers via the existingOP_TFopcode (no new wire protocol). Receives to a per-IP temp file on disk for safety under the OMP parallel scan loop.In
scan_quorum(), each candidate peer is now:NTFTXtrailerssg_validate_proof_chain()sg_proof_store()for the Phase 3 tfile tail check.Effect: a peer advertising a fabricated
(weight, hash, bnum)triple that can't produce a self-consistentNTFTXproof segment is rejected immediately.Phase 3:
src/sync.c— tfile spot-check and bad-tfile caching inresync()The tfile acquisition loop now:
bnum * sizeof(BTRAILER)offset in the downloaded tfile and byte-compare against the cached proof from Phase 2. Mismatch = the peer served a different chain than it advertised. Add the tfile hash to the bad cache and drop the peer.validate_tfile()+validate_tfile_pow()checks. On any failure, add the tfile hash to the bad cache and drop the peer.(bnum, weight)is actually met by the tfile. On mismatch, cache and drop.Previously, any tfile failure immediately returned
VERRORfromresync(). Now the loop iterates through quorum members, giving the node a chance to find a good tfile without exiting the process.Phase 4:
src/sync.c,src/sync.h,src/bin/mochimo.c— chain re-selectionReplaces
restart()calls inresync()'sgettfileandgetneophases withsg_bad_chain_add()+return VERROR. The bootstrap loop inmochimo.cthen re-scans, andscan_quorum()(which already consultssg_bad_chain_check()from Phase 2) skips the excluded chain. Adds ahighhashparameter toresync()so it knows which chain to mark bad.restart()exit(1)'d the process and lost all session caches. The new behavior keeps the caches alive across chain-selection retries.Design fix: proof matching by bnum offset
Initially Phase 3 compared the proof to the tfile's tail, but the chain may advance between
scan_quorum()andresync(). Since the tfile stores trailers in bnum-continuous order (trailer at byte offsetN * sizeof(BTRAILER)hasbnum == N), the proof should be sought atproof[0].bnum * sizeof(BTRAILER)and matched in place. The tfile may have advanced past the proof's tip; we only care that the proof's own historical range matches byte-exactly.Threat model
validate_tfile_pow()if PoW is realb_update()rejects bad blocks but wastes bandwidthDeferred follow-ups
This PR is intentionally conservative on the following, which can be handled separately:
Verification
make NO_CUDA=1passes on gcc-13 / Ubuntu WSL (all warnings-as-errors).qualified (proof verified)log messagestfile.dat is validandmatches advertised bnum and weightcatchup()progressed through blocks applyingUpdate-block/Pseudo-blocknormallyTest plan
make testpasses