Skip to content

sync: harden SQLite durability and stop blaming tokens for network blips#326

Merged
wesm merged 1 commit into
kenn-io:mainfrom
robelkin:worktree-sqlite-durability-oauth-errors
May 24, 2026
Merged

sync: harden SQLite durability and stop blaming tokens for network blips#326
wesm merged 1 commit into
kenn-io:mainfrom
robelkin:worktree-sqlite-durability-oauth-errors

Conversation

@robelkin
Copy link
Copy Markdown
Contributor

What changed

  • store: SQLite DSN switched from synchronous=NORMAL to synchronous=FULL + fullfsync=true. NORMAL only protects against process crashes; FULL protects against OS/power crashes too. fullfsync is the macOS-only fcntl that forces drive cache flush; no-op elsewhere.
  • serve: runScheduledSync now distinguishes transient network errors (DNS lookup timeout, dial timeout, no route) from real auth errors when refreshing tokens. Old code suggested re-authorization on every failure.
  • scheduler: Transient-network sync failures are logged at WARN instead of ERROR.
  • syncerr: New shared package with IsTransientNetwork(error) bool. Checks typed *net.DNSError / *net.OpError / net.Error (Timeout()) and falls back to substring matching for libraries that wrap with %v instead of %w (notably golang.org/x/oauth2).

Why

Log analysis of a 24/7 laptop msgvault serve showed ~1,200 sync failures attributed to "token may be expired" — but only 14 of those were actual oauth2: invalid_grant. The other ~98% were DNS lookup timeouts after sleep/wake or Wi-Fi flaps. The misleading error message sent the user to re-authorize valid tokens.

The same laptop was hitting recurring database disk image is malformed errors in sync_runs (the most write-heavy table). NORMAL + WAL is durable against process crashes but not OS-level crashes / power loss, which a laptop daemon experiences regularly.

How to use

No flags. Behavior changes automatically:

  • New SQLite connections write more durably.
  • Scheduler logs and daemon error messages classify network errors separately from auth errors.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 20, 2026

roborev: Combined Review (18b10b7)

No Medium, High, or Critical issues found.

All reviewers reported the change as clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 20, 2026

roborev: Combined Review (d5c0ae3)

No Medium, High, or Critical findings were reported.

All reviewed issues are below the reporting threshold. The PR appears clean for Medium+ severity findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 21, 2026

roborev: Combined Review (d5c0ae3)

No Medium, High, or Critical findings were reported.

All agents that provided substantive feedback found the code clean; no PR comment findings to include.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

… blips

- store: switch SQLite DSN to synchronous=FULL + fullfsync=true. NORMAL only
  protects against process crashes; FULL protects against OS/power crashes.
  Matters for `msgvault serve` on laptops (sleep/wake, forced reboots, OOM)
  where torn writes have been corrupting sync_runs — the most write-heavy
  table. Write volume is tiny, so the durability cost is negligible.
- store: add OpenForTest with synchronous=OFF for fast test DBs.
  synchronous=FULL adds an fsync per commit, which on Windows CI runners
  pushed bulk-import tests like TestImportDYI_TimingTripwire past their
  30-second timing tripwires (~50s on the Windows job). Tests use ephemeral
  t.TempDir() databases that get discarded at exit, so OS-crash durability
  is irrelevant. openSQLite now takes a params string so the test path can
  opt into a fast DSN; testutil.NewTestStore calls store.OpenForTest while
  production callers continue through store.Open with the durable DSN.
- serve: distinguish transient network errors (DNS lookup timeout, dial
  timeout, no route) from real auth errors when refreshing tokens. The old
  message blamed the token on every failure, telling users to re-authorize
  perfectly valid tokens after a Wi-Fi flap.
- scheduler: downgrade transient-network sync failures from ERROR to WARN
  so a laptop wake-up doesn't paint dashboards red.
- syncerr: new shared package classifying transient network errors (typed
  net.DNSError/net.OpError plus substring fallback for libraries that wrap
  with %v instead of %w).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wesm wesm force-pushed the worktree-sqlite-durability-oauth-errors branch from d5c0ae3 to ad3a5a8 Compare May 24, 2026 22:18
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 24, 2026

roborev: Combined Review (ad3a5a8)

No Medium, High, or Critical findings; the reviewed changes look clean.

All reported issues were Low severity, so they are omitted per the requested threshold.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit 1d53b4b into kenn-io:main May 24, 2026
6 checks passed
@wesm
Copy link
Copy Markdown
Member

wesm commented May 24, 2026

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants