Add Facebook Messenger DYI import#281
Conversation
roborev: Combined Review (
|
|
I don't love the roborev loops but here we are. |
roborev: Combined Review (
|
630a87f to
14a00f6
Compare
|
Addressed @trel feedback & collapsed reviews into features and fixes. Also tested against my facebook archives. |
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
|
Interesting scenario. Hard to design a test for. I don't think people are going to merge in these very often. Will see what a fix looks like. |
|
I will add this is already useful for me. |
roborev: Combined Review (
|
|
The two High findings in the combined review are false positives — every cited symbol exists in the tree.
Store APIs the importer supposedly references that don't exist:
All of these also exist on The Medium |
roborev: Combined Review (
|
|
Triaged the combined review; fixes for the two real Mediums pushed as High (missing store APIs) — false positive, same as the previous pass. Every symbol cited is defined and builds clean locally and in CI (
Medium: first-thread checkpoints not resumable — fixed. The prior guard Medium: failed threads skipped on resume — fixed. Outer thread loop no longer advances the per-thread checkpoint on hard error. It logs, sets Medium: |
roborev: Combined Review (
|
|
Took a look at this against the PR head ( High — missing store APIs: Incorrect at Medium — Rebind in Medium — first-thread checkpoint not resumable: Was real at Medium — failed thread skipped on resume: Was real at |
|
Heads up — I've rebased the fbmessenger work onto #304 so it composes with the new account-identity / collections / dedup model. The post-rebase branch is up at |
|
i'm rebasing this |
Import Facebook Messenger data from DYI exports (JSON and HTML formats), E2EE flat-file exports, and multiple directory layouts including your_activity_across_facebook, your_facebook_activity, and legacy messages/ roots. Includes CLI command, MIME parsing, attachment resolution, Discover/Parse pipeline, and query engine support for messenger message types. - fix: DecodeMojibake preserves valid UTF-8 with Latin-1-range chars - test: use synthetic names in test fixtures - fix: resolve attachments under your_facebook_activity export layout - fix: resume interrupted Messenger imports and position HTML images correctly - fix: reject unknown --format values in import-messenger - merge upstream/main into jesse/fbmessenger - fix: section-qualify thread IDs to prevent cross-section collisions - Address review findings (job 114) - fix: only resume from incomplete sync runs, not completed ones - fbmessenger: orphan sender linkage, mid-thread limit, from-guard, HTML scan - fbmessenger: stream-probe E2EE discovery and classify non-thread shapes - fbmessenger: preserve sender metadata on re-import - Merge remote-tracking branch 'upstream/main' into jesse/fbmessenger - fbmessenger: resume from first-thread checkpoint; don't advance on failure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
Three valid findings from the combined review:
- GetLatestCheckpointedSync returned the most recent failed run even
when a later completed run existed for the same source, because the
status filter ('running','failed') excluded the completed run.
A subsequent re-import would silently resume from the older failed
checkpoint and skip threads already covered by the completed run.
Restrict the query to id = MAX(id) so we only resume when the
*latest* run is failed/running and has a checkpoint. Regression
test: TestImportDYI_StaleFailedCheckpointIgnoredAfterCompletion.
- detectFormat treated any message_*.json (including unnumbered
siblings like message_final.json) as JSON. ParseJSONThread accepts
only numbered message_<N>.json, so a thread containing valid HTML
plus only an unnumbered JSON sibling was classified as "both" or
"json" in auto mode, ParseJSONThread returned "no message_*.json
files in ...", and the valid HTML was skipped. Match
ParseJSONThread's regex (reMessageFile) so the unnumbered sibling
no longer promotes the thread off the HTML path. Regression test:
TestDiscover_UnnumberedJSONSiblingNotMisclassified.
- handleAttachment opened att.AbsPath without checking for symlinks
or non-regular files. A malicious DYI export that planted a symlink
inside its photos/ tree pointing at a sensitive local file (e.g.
~/.ssh/id_rsa) would otherwise have its target copied into the
content-addressed attachment store. Lstat first, reject anything
that isn't a regular file, and re-check after Open via f.Stat().
Regression test: TestImportDYI_AttachmentSymlinkRejected (also
greps the attachment store for the secret bytes as defense in
depth).
Two findings were not valid and require no change:
- s.dialect.Rebind() does compile: dialect is a real field on
*Store (internal/store/store.go) and is accessed directly by
several existing call sites (store.go, sync.go, migrate_*.go).
- buildCache already projects message_type into the Parquet output
via both export paths (build_cache.go:304 and :691). The
existing TestBuildCache_AfterMessengerImport, which selects
DISTINCT message_type from the Parquet partitions, passes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
|
Woah, thanks @wesm! Hopefully not too much work to kick to fit. |
Summary
msgvault import-messengercommand for Facebook "Download Your Information" exportsmessages/and neweryour_activity_across_facebook/messages/DYI layoutsmessage_type = 'email'filter from query engines so all archived message types participate in search and aggregation~4,700 lines across 44 files. Core importer is ~2,100 lines with ~1,600 lines of tests.
Closes #280. Related: #136, #192, #278.
🤖 Generated with Claude Code