feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection) by adequatelimited · Pull Request #150 · mochimodev/mochimo

adequatelimited · 2026-04-13T02:41:23Z

Summary

Subsystem overhaul hardening the node sync flow against fabricated or corrupt peer advertisements. Builds on F-09 (#148): where that change ensured quorum members share the same chain, this change ensures the chain they advertise is actually the chain they serve, and that the node degrades gracefully when it is not.

Scope is intentionally narrow: only src/network.c, src/sync.c, src/bin/mochimo.c, and two new files (src/syncguard.{h,c}). Consensus rules, transaction validation, PoW, ledger format, and all other subsystems are unchanged.

Commits

bc9b1cb — Phase 1: data structures and session caches (syncguard.{h,c})
1125f13 — Phase 2: proof spot-check in scan_quorum()
ca39658 — Phase 3: tfile spot-check and bad-tfile caching in resync()
44d3549 — Phase 4: chain re-selection on quorum exhaustion
628b45e — fix(syncguard): spot-check proof by bnum offset, not tfile tail

(Plus the F-09 commit 0411f05 at the base, already merged to master via #149.)

Why this overhaul

The F-09 narrow fix solved the quorum assembly determinism problem but left several gaps that the audit identified during review:

Gap	State before this PR	Impact
Pre-download proof spot-check	None — peers self-report weight/hash/bnum, trusted up to tfile validation	Malicious peer can waste node's bandwidth serving invalid tfiles
Known-bad tfile caching	None — a failed tfile is discarded, the next quorum member's tfile is re-downloaded and re-validated from scratch	Repeated validation cost for the same bad tfile distributed across multiple coordinated attackers
Chain re-selection on failure	None — quorum members are removed one-by-one until empty, then `restart()` (exit)	Infinite sync loop if all quorum members share a corrupt chain
Fabricated-weight defense	None — scan trusts peer-reported weight until tfile download fails	Sybil attack can stall node sync indefinitely

What changed

Phase 1: `src/syncguard.{h,c}` — data structures and session caches

Session-local in-memory caches:

Bad-chain exclusion list — (weight, hash) pairs that failed sync this session
Bad-tfile hash cache — SHA-256 hashes of tfiles that failed any validation step
Per-peer proof segment cache — last NTFTX trailers validated for each quorum candidate, keyed by IP

All caches zero-init at process start. They persist across resync() retries within the same process so the node does not fall into repeated failures against the same bad actors.

Also provides:

sg_hash_file() — SHA-256 of a file on disk, for bad-tfile cache keys
sg_proof_match_tfile() — byte-exact comparison of cached proof against the corresponding bnum range in a downloaded tfile
sg_validate_proof_chain() — structural chain validation (bnum increments by 1, phash links, tip matches advertised). No PoW check; that is too expensive per-peer and is fully validated downstream in validate_tfile_pow().

Phase 2: `src/network.c` — proof spot-check in `scan_quorum()`

Adds get_tf_proof() helper to request NTFTX trailers via the existing OP_TF opcode (no new wire protocol). Receives to a per-IP temp file on disk for safety under the OMP parallel scan loop.

In scan_quorum(), each candidate peer is now:

Checked against the bad-chain exclusion list (skip if known bad)
Asked for its last NTFTX trailers
Structurally validated via sg_validate_proof_chain()
Admitted to the quorum only if proof-valid AND matching the current high-chain hash. Their validated proof is stored via sg_proof_store() for the Phase 3 tfile tail check.

Effect: a peer advertising a fabricated (weight, hash, bnum) triple that can't produce a self-consistent NTFTX proof segment is rejected immediately.

Phase 3: `src/sync.c` — tfile spot-check and bad-tfile caching in `resync()`

The tfile acquisition loop now:

Downloads tfile from a quorum member
Hashes the downloaded file (SHA-256) and consults the bad-tfile cache — skip if a previous peer already served identical bad content
Spot-check: seek to the proof's bnum * sizeof(BTRAILER) offset in the downloaded tfile and byte-compare against the cached proof from Phase 2. Mismatch = the peer served a different chain than it advertised. Add the tfile hash to the bad cache and drop the peer.
Existing validate_tfile() + validate_tfile_pow() checks. On any failure, add the tfile hash to the bad cache and drop the peer.
Verify advertised (bnum, weight) is actually met by the tfile. On mismatch, cache and drop.

Previously, any tfile failure immediately returned VERROR from resync(). Now the loop iterates through quorum members, giving the node a chance to find a good tfile without exiting the process.

Phase 4: `src/sync.c`, `src/sync.h`, `src/bin/mochimo.c` — chain re-selection

Replaces restart() calls in resync()'s gettfile and getneo phases with sg_bad_chain_add() + return VERROR. The bootstrap loop in mochimo.c then re-scans, and scan_quorum() (which already consults sg_bad_chain_check() from Phase 2) skips the excluded chain. Adds a highhash parameter to resync() so it knows which chain to mark bad.

restart() exit(1)'d the process and lost all session caches. The new behavior keeps the caches alive across chain-selection retries.

Design fix: proof matching by bnum offset

Initially Phase 3 compared the proof to the tfile's tail, but the chain may advance between scan_quorum() and resync(). Since the tfile stores trailers in bnum-continuous order (trailer at byte offset N * sizeof(BTRAILER) has bnum == N), the proof should be sought at proof[0].bnum * sizeof(BTRAILER) and matched in place. The tfile may have advanced past the proof's tip; we only care that the proof's own historical range matches byte-exactly.

Threat model

Attack	Current defense	Defense after this PR
Single malicious peer advertising fake weight	Tfile validation eventually rejects	Proof spot-check in Phase 2 rejects before tfile download
Sybil group at fabricated weight	Tfile validation eventually rejects but node may loop	Proof spot-check + bad-chain caching + chain re-selection
Coordinated bad tfile distribution	Repeated full validation of each bad copy	Bad-tfile hash cache short-circuits after first validation
Partial tfile with crafted tail	Passes `validate_tfile_pow()` if PoW is real	Proof segment byte-match catches tail-tampering
Bandwidth amplification via malicious catchup peer	`b_update()` rejects bad blocks but wastes bandwidth	Quorum membership now predicated on proof validation; malicious peers are filtered before catchup

Deferred follow-ups

This PR is intentionally conservative on the following, which can be handled separately:

Plurality-based chain selection (design doc Stage 5): two-pass scan to pick the chain with the most peers at max weight. Strictly stronger than F-09 numerically-highest-hash tiebreaking during legitimate splits. Not implemented because (a) it adds scan latency and memory, (b) the other defenses in this PR already counter Sybil-inflated quorums, and (c) splits resolve naturally once block propagation continues. Left as future work if operators see split-related issues in practice.
Persistent (across-restart) bad-chain / bad-tfile caches with staleness controls.
Full PoW validation on proof segments (currently structural only to keep the scan fast).

Verification

Clean build: make NO_CUDA=1 passes on gcc-13 / Ubuntu WSL (all warnings-as-errors).
End-to-end smoke test on mainnet (with PoW bypass for speed; reverted before commit):
- Multiple peers qualified with qualified (proof verified) log messages
- Proof fetch and structural validation work
- Tfile spot-check correctly rejected peers advertising a chain whose tfile didn't match
- After the offset fix, a legitimate peer's tfile passed the spot-check
- tfile.dat is valid and matches advertised bnum and weight
- catchup() progressed through blocks applying Update-block/Pseudo-block normally

Test plan

Clean build on Ubuntu x64, macOS arm64, Ubuntu arm64 CI
make test passes
Live mainnet sync from a fresh working directory reaches steady-state
Behavior against a mock malicious peer advertising an unbuildable chain: rejected before tfile download (requires test harness)
Behavior against a mock malicious peer serving a tfile whose tail doesn't match its proof: tfile cached as bad, peer dropped, next peer tried

Introduces the sync-subsystem hardening module (src/syncguard.{h,c}), used by later phases to add defense-in-depth to the node sync flow. Provides session-local (in-memory) caches: - Bad-chain exclusion list: (weight, hash) pairs that failed sync - Bad-tfile hash cache: SHA-256 hashes of tfiles that failed validation - Per-peer proof segment cache: last NTFTX trailers advertised by each quorum candidate, for byte-exact matching against the tfile they subsequently serve All caches are cleared at the start of each resync() attempt via sg_session_reset(). Also exposes: - sg_hash_file(): SHA-256 of a file on disk, for bad-tfile cache keys - sg_proof_match_tfile_tail(): byte-exact tail comparison - sg_validate_proof_chain(): structural chain validation (no PoW) No behavioral change in this commit — the caches and helpers are defined but not yet called by any existing code path. That wiring happens in Phases 2-4.

Adds get_tf_proof() helper to fetch a peer's last NTFTX block trailers via OP_TF, stored in a per-IP temp file on disk for parallel safety across the scan_quorum() OMP loop. scan_quorum() now, for each candidate peer: 1. Captures the peer's advertised (weight, hash, bnum) in thread-local storage after the existing OP_GET_IPL handshake. 2. Checks the bad-chain exclusion list (Phase 4 infrastructure) before any expensive proof work. 3. Requests the peer's last NTFTX trailers via get_tf_proof(). 4. Validates the proof segment structurally with sg_validate_proof_chain(): - bnum increments by 1 across consecutive trailers - phash of trailer[i+1] equals bhash of trailer[i] - tip bhash equals advertised hash - tip bnum equals advertised bnum PoW validation is intentionally skipped here; it is cost-prohibitive per-peer and is fully validated downstream in validate_tfile_pow(). 5. Only peers that pass the proof spot-check AND match the current high chain hash are admitted to the quorum. Their validated proof is stored via sg_proof_store() for Phase 3's tfile tail check. Effect: a peer advertising a fabricated (weight, hash, bnum) triple without also being able to produce a self-consistent NTFTX proof segment is rejected from the quorum immediately, without the node wasting bandwidth downloading their tfile.

Augments the tfile acquisition step in resync() with defense-in-depth: 1. sg_session_reset() at resync() entry clears the per-session exclusion and proof caches established in Phases 1 and 2. 2. After downloading a candidate tfile (tfile.tmp), hash it via SHA-256 and consult the bad-tfile cache. If the same content has already failed validation from another quorum member, skip this peer immediately without re-running validation. 3. Tail spot-check: the last NTFTX trailers of the downloaded tfile are compared byte-exactly against the proof segment that this peer served during scan_quorum(). A mismatch means the peer served a different chain than it advertised — add the tfile hash to the bad-tfile cache and drop the peer from the quorum. 4. If the existing validate_tfile() or validate_tfile_pow() checks fail, or the advertised (weight, bnum) is not met, add the tfile hash to the bad-tfile cache and continue to the next quorum member instead of aborting the entire sync. Previously, any tfile failure immediately returned VERROR from resync(); now the loop iterates through quorum members, giving the node a chance to find a good tfile without exiting to restart().

Previously, resync() called restart() when it exhausted all quorum members for a given chain. restart() exit()s the process, which: 1. Loses the session-local bad-chain and bad-tfile caches 2. Forces the bootstrap process to start from scratch — potentially picking the same bad chain again if the malicious advertisements persist across the restart This commit: - Adds a highhash parameter to resync() so the function knows which chain the quorum is targeting - On quorum exhaustion during gettfile OR getneo, marks the (weight, hash) pair bad via sg_bad_chain_add() and returns VERROR instead of calling restart() - scan_quorum() already consults sg_bad_chain_check() (wired in Phase 2), so a subsequent re-scan will skip this chain The result: when a chain's quorum is exhausted, the bootstrap loop in mochimo.c iterates to the next-best chain automatically, within the same process (preserving the session caches from Phases 1-3).

Two bugs uncovered during end-to-end WSL smoke testing on mainnet: 1. sg_proof_match_tfile_tail() required the proof segment to sit byte-exactly at the TAIL of the downloaded tfile. If the peer's chain advanced between scan_quorum() (proof fetch) and resync() (full tfile download) — even by a single block — the proof would no longer be at the tail and the check would fire false positives. Since tfile trailers are stored in bnum-continuous order starting from genesis (trailer at byte offset N*sizeof(BTRAILER) has bnum==N), we can instead seek directly to proof[0].bnum's offset and compare the proof's historical trailers in place. The tfile may have advanced past the proof's tip, but the proof's own range is historical and cannot change without a reorg. Renamed to sg_proof_match_tfile() to reflect the new semantics. 2. resync() called sg_session_reset() at entry, which also cleared the proof cache that scan_quorum() had just populated. The proofs must persist across the scan→resync boundary. Session reset is no longer automatic on resync(); the session caches now persist for the lifetime of the process (zero-initialized at startup via static storage, overwritten naturally on re-scan). sg_session_reset() remains available as an exported helper. Also downgraded diagnostic logs in sg_proof_match_tfile() from plog() to pdebug() so they only appear at debug log level.

adequatelimited · 2026-04-13T04:30:53Z

This subsystem is a must-have for v3.1, so will need to be included in that code, but is not required for the current audit remediation cycle.

adequatelimited added 5 commits April 12, 2026 21:28

adequatelimited force-pushed the master branch from ba86e0b to 45ec896 Compare April 13, 2026 03:14

adequatelimited force-pushed the feature/quorum-sync-hardening branch from 628b45e to 5eff1b0 Compare April 13, 2026 03:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150

feat(sync): quorum/sync subsystem hardening (proof spot-check, bad-tfile cache, chain re-selection)#150
adequatelimited wants to merge 5 commits intomasterfrom
feature/quorum-sync-hardening

adequatelimited commented Apr 13, 2026 •

edited

Loading

Uh oh!

adequatelimited commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adequatelimited commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Why this overhaul

What changed

Phase 1: src/syncguard.{h,c} — data structures and session caches

Phase 2: src/network.c — proof spot-check in scan_quorum()

Phase 3: src/sync.c — tfile spot-check and bad-tfile caching in resync()

Phase 4: src/sync.c, src/sync.h, src/bin/mochimo.c — chain re-selection

Design fix: proof matching by bnum offset

Threat model

Deferred follow-ups

Verification

Test plan

Uh oh!

adequatelimited commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adequatelimited commented Apr 13, 2026 •

edited

Loading

Phase 1: `src/syncguard.{h,c}` — data structures and session caches

Phase 2: `src/network.c` — proof spot-check in `scan_quorum()`

Phase 3: `src/sync.c` — tfile spot-check and bad-tfile caching in `resync()`

Phase 4: `src/sync.c`, `src/sync.h`, `src/bin/mochimo.c` — chain re-selection