Skip to content

Add Facebook Messenger DYI import#281

Merged
wesm merged 2 commits into
wesm:mainfrom
jesserobbins:jesse/fbmessenger
May 6, 2026
Merged

Add Facebook Messenger DYI import#281
wesm merged 2 commits into
wesm:mainfrom
jesserobbins:jesse/fbmessenger

Conversation

@jesserobbins
Copy link
Copy Markdown
Contributor

Summary

  • Adds msgvault import-messenger command for Facebook "Download Your Information" exports
  • Supports JSON, HTML, and E2EE flat-file formats with auto-detection per thread
  • Handles both old messages/ and newer your_activity_across_facebook/messages/ DYI layouts
  • Decodes Facebook's Latin-1-over-UTF-8 mojibake in JSON exports transparently
  • Ingests attachments (photos, videos, stickers, audio) into content-addressed storage
  • Preserves reactions as relational rows and in message body for FTS
  • Resumable checkpoints every 50 threads
  • Removes hard-coded message_type = 'email' filter from query engines so all archived message types participate in search and aggregation

~4,700 lines across 44 files. Core importer is ~2,100 lines with ~1,600 lines of tests.

Closes #280. Related: #136, #192, #278.

🤖 Generated with Claude Code

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (791264b)

Summary verdict: one Medium issue remains; no High or Critical findings were reported.

Medium

  • internal/fbmessenger/slug.go:64 - DecodeMojibake can corrupt already-valid Latin-1-range Unicode strings. For example, valid UTF-8 text like café may be converted into invalid UTF-8 bytes instead of being preserved, affecting message bodies, participant names, titles, and E2EE exports.
    • Fix: only apply the Latin-1-to-UTF-8 repair when the converted result is valid UTF-8 and appears to actually fix mojibake; otherwise return the original string.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

I don't love the roborev loops but here we are.

Comment thread CLAUDE.md
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (630a87f)

Medium issue found: Facebook Messenger JSON attachment resolution misses the your_facebook_activity/messages export layout.

Medium

  • internal/fbmessenger/json_parser.go around resolveAttachmentURI
    • JSON attachment resolution does not handle the your_facebook_activity/messages layout that Discover explicitly supports.
    • For exports in that layout, URIs like messages/inbox/.../photos/x.jpg are only tried under absRoot/your_activity_across_facebook/... and absRoot/..., so existing attachments under absRoot/your_facebook_activity/messages/... are recorded as missing and not copied.
    • Fix: add filepath.Join(absRoot, "your_facebook_activity", cleaned) to the candidate paths and cover it with a fixture/test for the alternate layout plus media.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins jesserobbins marked this pull request as ready for review April 21, 2026 20:04
@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented Apr 21, 2026

Addressed @trel feedback & collapsed reviews into features and fixes.

Also tested against my facebook archives.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (14a00f6)

I’ll consolidate the reviews into a PR-ready markdown comment, keeping only medium-or-higher findings and merging duplicates.
Medium-risk issues found in the Messenger import path; security review found no additional concerns.

Medium

  • internal/fbmessenger/importer.go:139
    Gracefully interrupted imports are marked failed, but resume only reads GetActiveSync, so the checkpoint saved before FailSync is ignored on the next run despite the CLI saying to rerun to continue.
    Fix: Resume from the latest checkpointed Messenger sync run, including failed interrupted runs, or avoid marking user-canceled imports failed in a way that hides their checkpoint.

  • internal/fbmessenger/html_parser.go:331
    Remaining HTML images are assigned to the first message without an attachment, regardless of where the <img> appeared, so photos in later messages can be stored on the wrong message.
    Fix: Track image positions while parsing message blocks, or parse per-message DOM containers so attachments stay associated with their containing message.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (f446440)

Findings: one medium issue needs fixing before merge.

Medium

  • internal/fbmessenger/importer.go:118 - ImportDYI lowercases opts.Format but does not validate it. An invalid CLI value such as --format jsno falls through the format switch, imports no normal JSON/HTML threads, and still returns success with zero messages.
    • Fix: Validate format against auto, json, html, and both before discovery/import, and return an error for unknown values.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (b39f436)

Medium risk: one issue found that can cause Facebook Messenger thread/message ID collisions.

Medium

  • internal/fbmessenger/importer.go:498
    source_message_id is based only on the thread directory name and per-thread message index. If a Facebook export contains the same thread directory basename in multiple sections, or across multiple discovered layout roots, those messages can collide under the same source and conversation ID, causing one thread’s messages to overwrite or dedupe the other.

    Fix: Include a stable section/layout discriminator in the conversation ID and source_message_id, or otherwise ensure discovered thread identities are globally unique within the source.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 22, 2026

roborev: Combined Review (de0afdc)

PR adds Facebook Messenger import support, but one resume-checkpoint bug can cause missed messages on repeat imports.

Medium

  • Location: internal/store/sync.go:240
    • Problem: GetLatestCheckpointedSync resumes from any sync run with cursor_before, including completed runs. Because successful imports save per-thread checkpoints, a later import of the same export root can treat the previous completed run as resumable and skip already-known threads, missing new messages added to existing threads.
    • Fix: Only resume from interrupted/running or failed sync runs, or clear cursor_before on successful completion. Add a regression test that imports a root, adds a message to an existing thread, then re-imports and verifies the new message is imported.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented Apr 23, 2026

Interesting scenario. Hard to design a test for. I don't think people are going to merge in these very often. Will see what a fix looks like.

@jesserobbins
Copy link
Copy Markdown
Contributor Author

I will add this is already useful for me.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (b8f4ded)

Verdict: Changes introduce two compile-breaking regressions and one medium-severity parsing bug; not ready to merge.

High

  • cmd/msgvault/cmd/import_messenger.go (around lines 50, 106-108, 141): The new command calls MustBeLocal(...) and rebuildCacheAfterWrite(...), but those helpers are not defined in the patch. This is a compile-time failure. Add the helpers in this branch or replace them with existing local-mode/cache rebuild logic already used elsewhere.

  • internal/fbmessenger/importer.go (around lines 455-720), internal/store/sources.go (around lines 137, 152): The importer references store APIs/fields that do not exist in the current store layer, including EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, LinkMessageLabel, UpsertReaction, RecomputeConversationStats, and s.dialect. This is a second compile-time break. Either implement the missing store APIs/field in the same PR or switch the importer to existing store methods.

Medium

  • internal/fbmessenger/html_parser.go (around lines 314-338): parseHTMLLines treats any body line that exactly matches a participant name as the start of a new sender block before confirming a timestamp. A legitimate message whose content is exactly a participant name can therefore be misparsed, causing the real message to be skipped and an empty message to be attributed instead. The parser should avoid using participant-name matches as an early delimiter and continue until a timestamp is found, or parse message containers directly from the DOM.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

The two High findings in the combined review are false positives — every cited symbol exists in the tree. go build ./... and go test ./... both pass on b8f4ded.

MustBeLocal / rebuildCacheAfterWrite:

  • cmd/msgvault/cmd/store_resolver.go:81func MustBeLocal(cmdName string) error
  • cmd/msgvault/cmd/build_cache.go:796func rebuildCacheAfterWrite(dbPath string)

Store APIs the importer supposedly references that don't exist:

  • internal/store/messages.go:1065EnsureConversationWithType
  • internal/store/messages.go:1190EnsureConversationParticipant
  • internal/store/messages.go:1205UpsertMessageRawWithFormat
  • internal/store/messages.go:756LinkMessageLabel
  • internal/store/messages.go:1197UpsertReaction
  • internal/store/messages.go:1032RecomputeConversationStats

s.dialect:

  • internal/store/store.go:32Store has a dialect Dialect field set in both New and Open constructors

All of these also exist on upstream/main (confirmed via git show upstream/main:<path>), so this isn't a case of a diff being reviewed against a stale base. The reviewer is hallucinating API absence — possibly by scoping to recent commits and seeing call sites without their definitions.

The Medium parseHTMLLines finding about participant-name-as-body-line is a known tradeoff that also came up on iter 1 of the local refine cycle (roborev job 206). The fix I landed there advances i to the next sender candidate when no timestamp is found, rather than the proposed tighter-window approach — the suggested window-tighten doesn't actually solve the "body line equals participant name" case (a real timestamp often still lies within 64 lines), and moving to full per-message DOM container parsing is a bigger rewrite than this PR scopes. Worth tracking but leaving in this PR.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (6a28ac0)

Overall verdict: Changes are not ready to merge due to one build-breaking issue and several medium-severity correctness issues.

High

  • Build break from missing store APIs
    Location: internal/fbmessenger/importer.go (writeThreadToStore / end-of-import stats update)
    Problem: The importer calls store methods that are not implemented in the current repo/change set: EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, LinkMessageLabel, UpsertReaction, RecomputeConversationStats. As written, this change does not build.
    Fix: Add the missing store APIs in the same change, or rewrite the importer to use existing store methods and supported data paths.

Medium

  • First-thread checkpoints are not resumable
    Location: internal/fbmessenger/importer.go (resume restore block around if prior.ThreadIndex > 0)
    Problem: Resume is only enabled when ThreadIndex > 0. A checkpoint saved while still processing the first thread has ThreadIndex == 0, so it is ignored and the rerun restarts that thread from scratch.
    Fix: Treat any valid checkpoint as resumable, including ThreadIndex == 0, and add a test covering interruption during the first thread.

  • Failed threads can be skipped on resume
    Location: internal/fbmessenger/importer.go (main thread loop after importThread(...))
    Problem: The checkpoint advances to threadIdx+1 even after a hard per-thread error. If a thread partially writes and then fails, a resumed run will skip that thread instead of retrying it, leaving the import incomplete.
    Fix: Only advance the checkpoint after a thread finishes successfully or is safely skipped; on hard failures, keep the checkpoint on the current thread or abort.

  • Cross-dialect SQL binding bug
    Location: internal/store/sync.go in GetLatestCheckpointedSync
    Problem: The query uses a positional ? placeholder but is not wrapped in s.dialect.Rebind(...), unlike other store queries. This can break on dialects such as PostgreSQL that require $1-style placeholders.
    Fix: Wrap the query string in s.dialect.Rebind(...).


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Triaged the combined review; fixes for the two real Mediums pushed as 1816893.

High (missing store APIs) — false positive, same as the previous pass. Every symbol cited is defined and builds clean locally and in CI (test + nix-build both SUCCESS). The roborev status check on this PR is also SUCCESS, contradicting the comment body. Call-site/definition references:

  • MustBeLocalcmd/msgvault/cmd/store_resolver.go:81
  • rebuildCacheAfterWritecmd/msgvault/cmd/build_cache.go:796
  • EnsureConversationWithTypeinternal/store/messages.go:1065
  • EnsureConversationParticipantinternal/store/messages.go:1190
  • UpsertMessageRawWithFormatinternal/store/messages.go:1205
  • LinkMessageLabelinternal/store/messages.go:756
  • UpsertReactioninternal/store/messages.go:1197
  • RecomputeConversationStatsinternal/store/messages.go:1032
  • s.dialectinternal/store/store.go:32 (populated in New and Open constructors)

Medium: first-thread checkpoints not resumable — fixed. The prior guard if prior.ThreadIndex > 0 ignored mid-first-thread checkpoints. Functionally no data loss (source_message_id dedup covers retry, outer loop already starts at threadIdx=0), but WasResumed stayed false, the "resuming" log didn't fire, and cumulative counters didn't carry forward. Guard removed; any well-formed checkpoint with matching RootDir is resumable. Added TestImportDYI_ResumeFromFirstThreadCheckpoint.

Medium: failed threads skipped on resume — fixed. Outer thread loop no longer advances the per-thread checkpoint on hard error. It logs, sets HardErrors=true, and continues without advancing — so a transient failure retries on the next run (source_message_id dedup keeps it idempotent), and a persistent failure surfaces cross-run via HardErrors=true. The user can rerun with --no-resume to force past a stuck thread if needed.

Medium: GetLatestCheckpointedSync cross-dialect binding — false positive. s.db is a *loggedDB wrapper (internal/store/db_logger.go:78) whose QueryRow/Query/Exec methods all call d.rebind(query) before delegating. The store's New constructor installs dialect.Rebind as that rebinder (store.go:104). So every ? placeholder in sync.go, messages.go, etc. is rewritten to $1-style at runtime on Postgres. The explicit s.dialect.Rebind(...) calls in sources.go:137,152 are redundant-but-harmless; the rest of the store relies on the wrapper, and that's intentional. GetLatestCheckpointedSync works fine on Postgres.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 23, 2026

roborev: Combined Review (1816893)

Verdict: Changes are not ready to merge due to one compile-blocking issue and several correctness/regression risks.

High

  • internal/fbmessenger/importer.go (writeThreadToStore)
    The importer calls store methods that are not present in this diff: EnsureConversationWithType, EnsureConversationParticipant, UpsertMessageRawWithFormat, and UpsertReaction. As written, this will not compile.
    Fix: Add these methods in internal/store or switch the importer to existing store APIs.

Medium

  • internal/fbmessenger/importer.go (checkpoint save after successful thread)
    After a hard thread error, the loop continues, and a later successful thread can advance the checkpoint past the failed thread. If the run stops after that, resume will start after the checkpoint and permanently skip the failed thread.
    Fix: Only checkpoint the highest contiguous successfully processed thread, or abort on hard thread errors.

  • cmd/msgvault/cmd/build_cache_messenger_test.go:74
    The test queries message_type from message Parquet files, but buildCache does not export m.message_type in the messages Parquet schema. This will fail with a missing-column error, and cached analytics cannot distinguish Messenger rows by type.
    Fix: Export m.message_type in the messages Parquet output and add the corresponding DuckDB Parquet CTE cast if it will be queried.

  • internal/fbmessenger/discover.go:222
    dec.Decode(&skip) with skip as json.RawMessage buffers the entire skipped subtree in memory. If a very large "messages" array appears before "participants", this lightweight streaming probe can still cause a major memory spike.
    Fix: Skip the subtree with a token-based depth counter using dec.Token() instead of decoding into json.RawMessage.

  • internal/query/duckdb.go (lines 647, 850) and internal/query/sqlite.go (lines 188, 258)
    The hardcoded message_type = 'email' filter was removed. If no replacement filtering was introduced, existing email-scoped query contexts may now return all message types, changing behavior and potentially breaking UI expectations.
    Fix: Confirm whether unified inbox behavior is intended; otherwise add an explicit message-type filter in AggregateOptions / MessageFilter.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Took a look at this against the PR head (1816893) and the commit roborev actually reviewed (6a28ac0). Two findings are stale, two are incorrect analysis.

High — missing store APIs: Incorrect at 6a28ac0. All six methods existed in internal/store/messages.go at that commit (LinkMessageLabel, RecomputeConversationStats, EnsureConversationWithType, EnsureConversationParticipant, UpsertReaction, UpsertMessageRawWithFormat). go build ./... passes on both commits.

Medium — Rebind in GetLatestCheckpointedSync: Incorrect at 6a28ac0. loggedDB auto-rebinds every query — see internal/store/db_logger.go:100,119 (query = d.rebind(query)), wired up at internal/store/store.go:104,146 via newLoggedDB(db, dialect.Rebind). Bare ? is the convention throughout the store; the suggested fix would double-rebind.

Medium — first-thread checkpoint not resumable: Was real at 6a28ac0, fixed in 1816893 (startThreadIdx = prior.ThreadIndex unconditionally).

Medium — failed thread skipped on resume: Was real at 6a28ac0, fixed in 1816893 (hard error path continues without advancing the checkpoint; source_message_id dedup keeps retry safe).

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Heads up — I've rebased the fbmessenger work onto #304 so it composes with the new account-identity / collections / dedup model. The post-rebase branch is up at jesserobbins/msgvault:jesse/fbmessenger-post-304 and includes a small follow-up to wire import-messenger into the auto-confirm-default-identity path the rest of the post-#304 importers use. If you want to take it for a spin alongside #304, that branch is the one to grab.

@wesm
Copy link
Copy Markdown
Owner

wesm commented May 6, 2026

i'm rebasing this

Import Facebook Messenger data from DYI exports (JSON and HTML formats),
E2EE flat-file exports, and multiple directory layouts including
your_activity_across_facebook, your_facebook_activity, and legacy
messages/ roots. Includes CLI command, MIME parsing, attachment
resolution, Discover/Parse pipeline, and query engine support for
messenger message types.

- fix: DecodeMojibake preserves valid UTF-8 with Latin-1-range chars
- test: use synthetic names in test fixtures
- fix: resolve attachments under your_facebook_activity export layout
- fix: resume interrupted Messenger imports and position HTML images correctly
- fix: reject unknown --format values in import-messenger
- merge upstream/main into jesse/fbmessenger
- fix: section-qualify thread IDs to prevent cross-section collisions
- Address review findings (job 114)
- fix: only resume from incomplete sync runs, not completed ones
- fbmessenger: orphan sender linkage, mid-thread limit, from-guard, HTML scan
- fbmessenger: stream-probe E2EE discovery and classify non-thread shapes
- fbmessenger: preserve sender metadata on re-import
- Merge remote-tracking branch 'upstream/main' into jesse/fbmessenger
- fbmessenger: resume from first-thread checkpoint; don't advance on failure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wesm wesm force-pushed the jesse/fbmessenger branch from 1816893 to b317c8b Compare May 6, 2026 01:04
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 6, 2026

roborev: Combined Review (b317c8b)

High-risk issues remain: the PR has compile/cache/resume bugs plus medium import safety/discovery problems.

High

  • [internal/store/sources.go:165] s.dialect.Rebind(...) references a dialect field that does not exist on Store, so this will not compile. Use the existing s.Rebind(...) method here and at the later delete query, or introduce and initialize a real dialect field consistently.

  • [internal/store/sync.go:249] GetLatestCheckpointedSync filters to failed/running syncs before ordering, so an old failed checkpoint can remain resumable after a later successful import. Future imports may resume from that stale checkpoint and silently skip earlier threads/messages. Choose the latest sync first and resume only if that latest run is failed/running, or ignore failed checkpoints older than the latest completed sync for the source.

  • [cmd/msgvault/cmd/build_cache_messenger_test.go:72] The new cache test expects message_type in the messages Parquet output, but buildCache does not appear to export that column. The query will fail, and cache consumers cannot distinguish Messenger rows. Add message_type to both build-cache export paths and the DuckDB Parquet CTE/casts.

Medium

  • [internal/fbmessenger/importer.go:801] Attachment import can follow symlinks out of the export root. A crafted export can point an attachment path at a symlink to sensitive local files, and os.Open(att.AbsPath) will copy the target into the archive. Resolve symlinks before opening and require the real path to remain under the real export root, or reject symlinks with Lstat; also verify the opened file is regular.

  • [internal/fbmessenger/discover.go:285] detectFormat treats any message_*.json as JSON, including unnumbered siblings like message_final.json. In auto mode, a thread with valid HTML plus only an unnumbered JSON sibling can be classified as JSON/both, causing JSON parsing to fail and the valid HTML to be skipped. Match ParseJSONThread’s numbered message_<N>.json validation or fall back to HTML when no valid numbered JSON files are found.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Three valid findings from the combined review:

- GetLatestCheckpointedSync returned the most recent failed run even
  when a later completed run existed for the same source, because the
  status filter ('running','failed') excluded the completed run.
  A subsequent re-import would silently resume from the older failed
  checkpoint and skip threads already covered by the completed run.
  Restrict the query to id = MAX(id) so we only resume when the
  *latest* run is failed/running and has a checkpoint. Regression
  test: TestImportDYI_StaleFailedCheckpointIgnoredAfterCompletion.

- detectFormat treated any message_*.json (including unnumbered
  siblings like message_final.json) as JSON. ParseJSONThread accepts
  only numbered message_<N>.json, so a thread containing valid HTML
  plus only an unnumbered JSON sibling was classified as "both" or
  "json" in auto mode, ParseJSONThread returned "no message_*.json
  files in ...", and the valid HTML was skipped. Match
  ParseJSONThread's regex (reMessageFile) so the unnumbered sibling
  no longer promotes the thread off the HTML path. Regression test:
  TestDiscover_UnnumberedJSONSiblingNotMisclassified.

- handleAttachment opened att.AbsPath without checking for symlinks
  or non-regular files. A malicious DYI export that planted a symlink
  inside its photos/ tree pointing at a sensitive local file (e.g.
  ~/.ssh/id_rsa) would otherwise have its target copied into the
  content-addressed attachment store. Lstat first, reject anything
  that isn't a regular file, and re-check after Open via f.Stat().
  Regression test: TestImportDYI_AttachmentSymlinkRejected (also
  greps the attachment store for the secret bytes as defense in
  depth).

Two findings were not valid and require no change:
- s.dialect.Rebind() does compile: dialect is a real field on
  *Store (internal/store/store.go) and is accessed directly by
  several existing call sites (store.go, sync.go, migrate_*.go).
- buildCache already projects message_type into the Parquet output
  via both export paths (build_cache.go:304 and :691). The
  existing TestBuildCache_AfterMessengerImport, which selects
  DISTINCT message_type from the Parquet partitions, passes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 6, 2026

roborev: Combined Review (6e3fc59)

Summary verdict: No Medium, High, or Critical findings were reported.

All reviewed agents either reported no issues or provided no findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit 896a446 into wesm:main May 6, 2026
6 checks passed
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Woah, thanks @wesm! Hopefully not too much work to kick to fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Facebook Messenger DYI import

3 participants