Skip to content

feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150

Open
adequatelimited wants to merge 5 commits intomasterfrom
feature/quorum-sync-hardening
Open

feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150
adequatelimited wants to merge 5 commits intomasterfrom
feature/quorum-sync-hardening

Conversation

@adequatelimited
Copy link
Copy Markdown
Collaborator

@adequatelimited adequatelimited commented Apr 13, 2026

Summary

Subsystem overhaul hardening the node sync flow against fabricated or corrupt peer advertisements. Builds on F-09 (#148): where that change ensured quorum members share the same chain, this change ensures the chain they advertise is actually the chain they serve, and that the node degrades gracefully when it is not.

Scope is intentionally narrow: only src/network.c, src/sync.c, src/bin/mochimo.c, and two new files (src/syncguard.{h,c}). Consensus rules, transaction validation, PoW, ledger format, and all other subsystems are unchanged.

Commits

  • bc9b1cb — Phase 1: data structures and session caches (syncguard.{h,c})
  • 1125f13 — Phase 2: proof spot-check in scan_quorum()
  • ca39658 — Phase 3: tfile spot-check and bad-tfile caching in resync()
  • 44d3549 — Phase 4: chain re-selection on quorum exhaustion
  • 628b45e — fix(syncguard): spot-check proof by bnum offset, not tfile tail

(Plus the F-09 commit 0411f05 at the base, already merged to master via #149.)

Why this overhaul

The F-09 narrow fix solved the quorum assembly determinism problem but left several gaps that the audit identified during review:

Gap State before this PR Impact
Pre-download proof spot-check None — peers self-report weight/hash/bnum, trusted up to tfile validation Malicious peer can waste node's bandwidth serving invalid tfiles
Known-bad tfile caching None — a failed tfile is discarded, the next quorum member's tfile is re-downloaded and re-validated from scratch Repeated validation cost for the same bad tfile distributed across multiple coordinated attackers
Chain re-selection on failure None — quorum members are removed one-by-one until empty, then restart() (exit) Infinite sync loop if all quorum members share a corrupt chain
Fabricated-weight defense None — scan trusts peer-reported weight until tfile download fails Sybil attack can stall node sync indefinitely

What changed

Phase 1: src/syncguard.{h,c} — data structures and session caches

Session-local in-memory caches:

  • Bad-chain exclusion list(weight, hash) pairs that failed sync this session
  • Bad-tfile hash cache — SHA-256 hashes of tfiles that failed any validation step
  • Per-peer proof segment cache — last NTFTX trailers validated for each quorum candidate, keyed by IP

All caches zero-init at process start. They persist across resync() retries within the same process so the node does not fall into repeated failures against the same bad actors.

Also provides:

  • sg_hash_file() — SHA-256 of a file on disk, for bad-tfile cache keys
  • sg_proof_match_tfile() — byte-exact comparison of cached proof against the corresponding bnum range in a downloaded tfile
  • sg_validate_proof_chain() — structural chain validation (bnum increments by 1, phash links, tip matches advertised). No PoW check; that is too expensive per-peer and is fully validated downstream in validate_tfile_pow().

Phase 2: src/network.c — proof spot-check in scan_quorum()

Adds get_tf_proof() helper to request NTFTX trailers via the existing OP_TF opcode (no new wire protocol). Receives to a per-IP temp file on disk for safety under the OMP parallel scan loop.

In scan_quorum(), each candidate peer is now:

  1. Checked against the bad-chain exclusion list (skip if known bad)
  2. Asked for its last NTFTX trailers
  3. Structurally validated via sg_validate_proof_chain()
  4. Admitted to the quorum only if proof-valid AND matching the current high-chain hash. Their validated proof is stored via sg_proof_store() for the Phase 3 tfile tail check.

Effect: a peer advertising a fabricated (weight, hash, bnum) triple that can't produce a self-consistent NTFTX proof segment is rejected immediately.

Phase 3: src/sync.c — tfile spot-check and bad-tfile caching in resync()

The tfile acquisition loop now:

  1. Downloads tfile from a quorum member
  2. Hashes the downloaded file (SHA-256) and consults the bad-tfile cache — skip if a previous peer already served identical bad content
  3. Spot-check: seek to the proof's bnum * sizeof(BTRAILER) offset in the downloaded tfile and byte-compare against the cached proof from Phase 2. Mismatch = the peer served a different chain than it advertised. Add the tfile hash to the bad cache and drop the peer.
  4. Existing validate_tfile() + validate_tfile_pow() checks. On any failure, add the tfile hash to the bad cache and drop the peer.
  5. Verify advertised (bnum, weight) is actually met by the tfile. On mismatch, cache and drop.

Previously, any tfile failure immediately returned VERROR from resync(). Now the loop iterates through quorum members, giving the node a chance to find a good tfile without exiting the process.

Phase 4: src/sync.c, src/sync.h, src/bin/mochimo.c — chain re-selection

Replaces restart() calls in resync()'s gettfile and getneo phases with sg_bad_chain_add() + return VERROR. The bootstrap loop in mochimo.c then re-scans, and scan_quorum() (which already consults sg_bad_chain_check() from Phase 2) skips the excluded chain. Adds a highhash parameter to resync() so it knows which chain to mark bad.

restart() exit(1)'d the process and lost all session caches. The new behavior keeps the caches alive across chain-selection retries.

Design fix: proof matching by bnum offset

Initially Phase 3 compared the proof to the tfile's tail, but the chain may advance between scan_quorum() and resync(). Since the tfile stores trailers in bnum-continuous order (trailer at byte offset N * sizeof(BTRAILER) has bnum == N), the proof should be sought at proof[0].bnum * sizeof(BTRAILER) and matched in place. The tfile may have advanced past the proof's tip; we only care that the proof's own historical range matches byte-exactly.

Threat model

Attack Current defense Defense after this PR
Single malicious peer advertising fake weight Tfile validation eventually rejects Proof spot-check in Phase 2 rejects before tfile download
Sybil group at fabricated weight Tfile validation eventually rejects but node may loop Proof spot-check + bad-chain caching + chain re-selection
Coordinated bad tfile distribution Repeated full validation of each bad copy Bad-tfile hash cache short-circuits after first validation
Partial tfile with crafted tail Passes validate_tfile_pow() if PoW is real Proof segment byte-match catches tail-tampering
Bandwidth amplification via malicious catchup peer b_update() rejects bad blocks but wastes bandwidth Quorum membership now predicated on proof validation; malicious peers are filtered before catchup

Deferred follow-ups

This PR is intentionally conservative on the following, which can be handled separately:

  • Plurality-based chain selection (design doc Stage 5): two-pass scan to pick the chain with the most peers at max weight. Strictly stronger than F-09 numerically-highest-hash tiebreaking during legitimate splits. Not implemented because (a) it adds scan latency and memory, (b) the other defenses in this PR already counter Sybil-inflated quorums, and (c) splits resolve naturally once block propagation continues. Left as future work if operators see split-related issues in practice.
  • Persistent (across-restart) bad-chain / bad-tfile caches with staleness controls.
  • Full PoW validation on proof segments (currently structural only to keep the scan fast).

Verification

  • Clean build: make NO_CUDA=1 passes on gcc-13 / Ubuntu WSL (all warnings-as-errors).
  • End-to-end smoke test on mainnet (with PoW bypass for speed; reverted before commit):
    • Multiple peers qualified with qualified (proof verified) log messages
    • Proof fetch and structural validation work
    • Tfile spot-check correctly rejected peers advertising a chain whose tfile didn't match
    • After the offset fix, a legitimate peer's tfile passed the spot-check
    • tfile.dat is valid and matches advertised bnum and weight
    • catchup() progressed through blocks applying Update-block/Pseudo-block normally

Test plan

  • Clean build on Ubuntu x64, macOS arm64, Ubuntu arm64 CI
  • make test passes
  • Live mainnet sync from a fresh working directory reaches steady-state
  • Behavior against a mock malicious peer advertising an unbuildable chain: rejected before tfile download (requires test harness)
  • Behavior against a mock malicious peer serving a tfile whose tail doesn't match its proof: tfile cached as bad, peer dropped, next peer tried

Introduces the sync-subsystem hardening module (src/syncguard.{h,c}),
used by later phases to add defense-in-depth to the node sync flow.

Provides session-local (in-memory) caches:
- Bad-chain exclusion list: (weight, hash) pairs that failed sync
- Bad-tfile hash cache: SHA-256 hashes of tfiles that failed validation
- Per-peer proof segment cache: last NTFTX trailers advertised by each
  quorum candidate, for byte-exact matching against the tfile they
  subsequently serve

All caches are cleared at the start of each resync() attempt via
sg_session_reset().

Also exposes:
- sg_hash_file(): SHA-256 of a file on disk, for bad-tfile cache keys
- sg_proof_match_tfile_tail(): byte-exact tail comparison
- sg_validate_proof_chain(): structural chain validation (no PoW)

No behavioral change in this commit — the caches and helpers are
defined but not yet called by any existing code path. That wiring
happens in Phases 2-4.
Adds get_tf_proof() helper to fetch a peer's last NTFTX block trailers
via OP_TF, stored in a per-IP temp file on disk for parallel safety
across the scan_quorum() OMP loop.

scan_quorum() now, for each candidate peer:
  1. Captures the peer's advertised (weight, hash, bnum) in thread-local
     storage after the existing OP_GET_IPL handshake.
  2. Checks the bad-chain exclusion list (Phase 4 infrastructure) before
     any expensive proof work.
  3. Requests the peer's last NTFTX trailers via get_tf_proof().
  4. Validates the proof segment structurally with
     sg_validate_proof_chain():
        - bnum increments by 1 across consecutive trailers
        - phash of trailer[i+1] equals bhash of trailer[i]
        - tip bhash equals advertised hash
        - tip bnum equals advertised bnum
     PoW validation is intentionally skipped here; it is cost-prohibitive
     per-peer and is fully validated downstream in validate_tfile_pow().
  5. Only peers that pass the proof spot-check AND match the current
     high chain hash are admitted to the quorum. Their validated proof
     is stored via sg_proof_store() for Phase 3's tfile tail check.

Effect: a peer advertising a fabricated (weight, hash, bnum) triple
without also being able to produce a self-consistent NTFTX proof
segment is rejected from the quorum immediately, without the node
wasting bandwidth downloading their tfile.
Augments the tfile acquisition step in resync() with defense-in-depth:

  1. sg_session_reset() at resync() entry clears the per-session
     exclusion and proof caches established in Phases 1 and 2.

  2. After downloading a candidate tfile (tfile.tmp), hash it via
     SHA-256 and consult the bad-tfile cache. If the same content
     has already failed validation from another quorum member, skip
     this peer immediately without re-running validation.

  3. Tail spot-check: the last NTFTX trailers of the downloaded tfile
     are compared byte-exactly against the proof segment that this
     peer served during scan_quorum(). A mismatch means the peer
     served a different chain than it advertised — add the tfile
     hash to the bad-tfile cache and drop the peer from the quorum.

  4. If the existing validate_tfile() or validate_tfile_pow() checks
     fail, or the advertised (weight, bnum) is not met, add the
     tfile hash to the bad-tfile cache and continue to the next
     quorum member instead of aborting the entire sync.

Previously, any tfile failure immediately returned VERROR from
resync(); now the loop iterates through quorum members, giving the
node a chance to find a good tfile without exiting to restart().
Previously, resync() called restart() when it exhausted all quorum
members for a given chain. restart() exit()s the process, which:
  1. Loses the session-local bad-chain and bad-tfile caches
  2. Forces the bootstrap process to start from scratch — potentially
     picking the same bad chain again if the malicious advertisements
     persist across the restart

This commit:
  - Adds a highhash parameter to resync() so the function knows which
    chain the quorum is targeting
  - On quorum exhaustion during gettfile OR getneo, marks the
    (weight, hash) pair bad via sg_bad_chain_add() and returns VERROR
    instead of calling restart()
  - scan_quorum() already consults sg_bad_chain_check() (wired in
    Phase 2), so a subsequent re-scan will skip this chain

The result: when a chain's quorum is exhausted, the bootstrap loop
in mochimo.c iterates to the next-best chain automatically, within
the same process (preserving the session caches from Phases 1-3).
Two bugs uncovered during end-to-end WSL smoke testing on mainnet:

1. sg_proof_match_tfile_tail() required the proof segment to sit
   byte-exactly at the TAIL of the downloaded tfile. If the peer's
   chain advanced between scan_quorum() (proof fetch) and resync()
   (full tfile download) — even by a single block — the proof would
   no longer be at the tail and the check would fire false positives.
   Since tfile trailers are stored in bnum-continuous order starting
   from genesis (trailer at byte offset N*sizeof(BTRAILER) has
   bnum==N), we can instead seek directly to proof[0].bnum's offset
   and compare the proof's historical trailers in place. The tfile
   may have advanced past the proof's tip, but the proof's own range
   is historical and cannot change without a reorg. Renamed to
   sg_proof_match_tfile() to reflect the new semantics.

2. resync() called sg_session_reset() at entry, which also cleared
   the proof cache that scan_quorum() had just populated. The proofs
   must persist across the scan→resync boundary. Session reset is
   no longer automatic on resync(); the session caches now persist
   for the lifetime of the process (zero-initialized at startup via
   static storage, overwritten naturally on re-scan). sg_session_reset()
   remains available as an exported helper.

Also downgraded diagnostic logs in sg_proof_match_tfile() from plog()
to pdebug() so they only appear at debug log level.
@adequatelimited
Copy link
Copy Markdown
Collaborator Author

This subsystem is a must-have for v3.1, so will need to be included in that code, but is not required for the current audit remediation cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant