Skip to content

feat: import-pst — import Microsoft Outlook PST archives#284

Merged
wesm merged 2 commits into
wesm:mainfrom
YourEconProf:import-pst
May 6, 2026
Merged

feat: import-pst — import Microsoft Outlook PST archives#284
wesm merged 2 commits into
wesm:mainfrom
YourEconProf:import-pst

Conversation

@YourEconProf
Copy link
Copy Markdown
Contributor

Adds import-pst, a new CLI command to import Microsoft Outlook PST files into msgvault. Complements the existing MBOX, EMLX, and IMAP importers.

What's new

  • msgvault import-pst <identifier> <file.pst> — imports all email messages from a PST archive; calendar items, contacts, tasks, and notes are skipped automatically
  • PST folder structure is preserved as labels (e.g. Inbox, Sent Items)
  • Resumable: interrupt with Ctrl+C and rerun to continue from the last checkpoint; --no-resume to start fresh
  • --skip-folder flag to exclude folders (e.g. --skip-folder "Deleted Items")
  • --no-attachments flag to skip attachment import
  • Content-hash deduplication and cross-folder label merging consistent with other importers
  • MIME reconstruction from PST: uses TransportMessageHeaders verbatim when present (~80% of messages); synthesizes RFC 5322 headers from MAPI properties for drafts and Exchange-native sends

Security fixes included

  • CRLF injection prevention on synthesized headers (addresses MAPI properties written directly into RFC 5322 headers)
  • Path traversal sanitization on attachment filenames
  • ContentID sanitization to prevent MIME structure breakage

Dependencies

Adds github.com/mooijtech/go-pst/v6 (Apache 2.0, pure Go).

Usage

msgvault import-pst you@company.com /path/to/archive.pst
msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items"
msgvault import-pst you@outlook.com backup.pst --no-resume

@YourEconProf YourEconProf requested a review from wesm as a code owner April 21, 2026 18:13
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (0eaebb3)

High confidence this PR still has one High and four Medium issues to fix before merge.

High

  • internal/importer/pst_import.go:377 and internal/importer/pst_import.go:318
    Checkpoint cursor tracking uses the outer loop variables (currentFolder, currentMsgIdx) instead of the message currently being ingested inside flushPending. If cancellation or a checkpoint interval happens during a flush, the saved checkpoint can point at the last queued message, or the first message of the next folder, causing earlier unflushed messages to be skipped permanently on resume.
    Fix: Store FolderIndex, FolderPath, and MsgIndex on each pending PST message and use those values when calling saveCp() during the flush loop.

Medium

  • internal/importer/pst_import.go:209
    ImportPst saves an initial checkpoint at folder/message 0 even when resuming an active sync. That overwrites the existing checkpoint before resumed work completes, so a crash before the next checkpoint can make the next run restart from the beginning.
    Fix: Only write the initial checkpoint for a new sync, or preserve the loaded resume position when summary.WasResumed is true.

  • internal/importer/pst_import.go:244
    Resume folder validation is skipped when resume.FolderIndex == 0, so a checkpoint inside the first folder can resume against the wrong folder if folder ordering or paths change. The import may skip the first MsgIndex messages from an unrelated folder.
    Fix: Validate whenever resuming with a non-empty FolderPath, including folder index 0.

  • internal/pst/mime.go:79 and internal/pst/mime.go:85
    att.MIMEType comes from PST data and is written into MIME part headers. If it contains CR/LF, it can inject additional MIME headers or corrupt the message structure.
    Fix: Validate or normalize attachment MIME types before use. Reject control characters, parse with mime.ParseMediaType, and fall back to application/octet-stream on invalid input.

  • internal/pst/reader.go:247
    The attachment memory limit can be bypassed when a corrupted or malicious PST reports size 0. att.WriteTo(&buf) may buffer the full payload into memory before the post-read size check runs, risking OOM.
    Fix: Wrap the buffer in a bounded io.Writer that returns an error as soon as the configured byte limit is exceeded.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 21, 2026

roborev: Combined Review (7d23b2c)

PST import support looks mostly solid, but there is one Medium issue that can silently ingest incomplete messages.

Medium

  • internal/importer/pst_import.go:478: When ReadAttachments fails, the importer logs the error but still imports the message without attachments. Size-limit truncation from ReadAttachments is also surfaced as success with only the attachments read before the limit, so oversized attachment sets can be silently imported as incomplete messages.
    • Fix: Treat attachment read/limit failures as a skipped message or hard error, or preserve a clear marker that attachments were intentionally omitted. Do not ingest reconstructed MIME that silently drops attachments.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Adds the ability to import Microsoft Outlook PST archives into msgvault,
complementing the existing MBOX, EMLX, and IMAP sources.

New files:
- internal/pst/reader.go: thin wrapper around mooijtech/go-pst v6 with
  folder traversal, message extraction, attachment reading, FILETIME→time.Time
  conversion, and Exchange DN resolution
- internal/pst/mime.go: reconstructs RFC 5322 MIME from PST messages —
  uses TransportMessageHeaders verbatim when present (~80% of messages),
  falls back to synthesizing headers from MAPI properties for drafts and
  Exchange-native sends
- internal/importer/pst_import.go: import orchestration following the MBOX
  importer pattern — batching (200 msg / 32 MiB), checkpoint/resume,
  content-hash dedup, cross-folder label merging
- cmd/msgvault/cmd/import_pst.go: CLI command with --skip-folder,
  --no-resume, --no-attachments flags and graceful Ctrl+C handling

Usage:
  msgvault import-pst you@company.com /path/to/archive.pst
  msgvault import-pst you@outlook.com backup.pst --skip-folder "Deleted Items"

Dependency: github.com/mooijtech/go-pst/v6 (Apache 2.0, pure Go)

Additional changes squashed in:
- test: add PST integration tests using go-pst sample files
- fix: set PST source display_name to filename on import
- fix: address 11 code review issues in PST import
- fix: address 5 automated review issues in PST import
- remove plan document, fix errcheck lint failures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 6, 2026

roborev: Combined Review (d6d2f19)

Verdict: Changes need fixes for resumability and PST message identity before merge.

High

PST message IDs can collide across archives

Location: internal/importer/pst_import.go:506

PST message IDs are built only from the PST entry ID, which is local to a single PST file. Importing multiple PST archives for the same identifier can collide on IDs like pst-123, causing unrelated messages from later archives to be skipped or updated as duplicates.

Fix: Include a stable per-file component in sourceMsgID, such as a PST file hash/fingerprint plus the entry ID, or make the source identity include the archive identity.

Cancellation checkpoint can skip an unprocessed message

Location: internal/importer/pst_import.go:326

On cancellation before processing the current pending message, the checkpoint is saved with p.MsgIndex, but resume skips messages with currentMsgIdx <= resume.MsgIndex. That drops the unprocessed message after an interrupted import.

Fix: Save the last successfully processed message index, not the next pending message index, or change resume semantics so the checkpoint points to the next message to process.

Interrupted import can be marked complete at an empty batch boundary

Location: internal/importer/pst_import.go:560; also reported near end of ImportPst, around internal/importer/pst_import.go:460-475

If the context is canceled when there are no pending messages to flush, flushPending() returns false and the importer falls through to CompleteSync(). This marks the interrupted sync as fully complete, discards the active checkpoint, and prevents reruns from resuming.

Fix: Check ctx.Err() before completing the sync, save or preserve the interruption checkpoint, and return without calling CompleteSync().


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Adding github.com/mooijtech/go-pst/v6 to go.mod changed the vendor tree
hash, breaking the nix-build CI job.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 6, 2026

roborev: Combined Review (992e4b2)

No Medium, High, or Critical findings were reported.

All review agents found the code clean or reported no actionable findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit e375f52 into wesm:main May 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants