feat: add windsurf-trajectory-extractor#123
Conversation
A Python CLI tool for extracting Windsurf Cascade conversation trajectories with deep protobuf decoding. Features: - Thinking content extraction (internal reasoning, not visible in UI) - Microsecond-precision timestamps from protobuf - Complete tool call parameters - Provider information - Cross-platform support (macOS, Linux, Windows) Technical highlights: - Pure Python standard library (no external dependencies) - Reverse-engineered protobuf structure for deep extraction - Supports both 'Windsurf' and 'Windsurf - Next' installations Differentiator from existing tools: - Unlike JSON-based extraction, this performs protobuf decoding - Extracts thinking content that JSON methods cannot access - Provides microsecond-precision timestamps
📝 WalkthroughWalkthroughThe pull request adds two new git submodules ( Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
🧹 Nitpick comments (9)
windsurf-trajectory-extractor/.gitignore (1)
45-46: Consider narrowing the global*.jsonlignore rule.
*.jsonlat repo scope may hide legitimate fixtures/docs added outsideexamples/later. If the intent is only generated outputs, consider a more targeted pattern (or a clearer naming convention for generated files).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/.gitignore` around lines 45 - 46, The global ignore entry "*.jsonl" is too broad and may hide non-generated fixtures; replace it with a narrower pattern that targets generated outputs (for example a directory-specific pattern or a naming convention) and keep the current exception "!examples/*.jsonl" if examples should remain tracked; update the .gitignore by removing the top-level "*.jsonl" and adding a more specific rule such as a generated/ or outputs/ directory pattern (or a suffix like "*.generated.jsonl") so only intended files are ignored while legitimate JSONL assets outside examples remain visible.windsurf-trajectory-extractor/README.md (2)
104-119: Add language specifier to protobuf structure block.The protobuf structure documentation block lacks a language identifier.
📝 Proposed fix
-``` +```text Top-level: f1 (string): Trajectory UUID ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/README.md` around lines 104 - 119, The fenced protobuf structure block starting with "Top-level:" (showing fields like f1, f2, repeated Step, f20 etc.) lacks a language specifier; update the opening triple-backtick to include a language tag (for example ```text) so the block becomes a proper code block with a language identifier, leaving the block contents unchanged and keeping the closing triple-backtick as-is.
96-98: Add language specifier to fenced code block.Per markdownlint, fenced code blocks should have a language specified for proper syntax highlighting and accessibility.
📝 Proposed fix
-``` +```text ~/Library/Application Support/Windsurf - Next/User/globalStorage/state.vscdb</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@windsurf-trajectory-extractor/README.mdaround lines 96 - 98, The fenced
code block in README.md containing the path~/Library/Application Support/Windsurf - Next/User/globalStorage/state.vscdblacks a language
specifier; update that code fence to include a language tag (e.g., use "text")
so the block starts with ```text to satisfy markdownlint and enable proper
highlighting/accessibility.</details> </blockquote></details> <details> <summary>windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py (5)</summary><blockquote> `479-480`: **Consider logging or narrowing exception handling.** The bare `except: pass` silently swallows all errors during keyword search. While this provides resilience, it could hide issues like malformed data. Consider at minimum catching `Exception` or logging errors in debug mode. <details> <summary>♻️ Proposed fix</summary> ```diff - except Exception: - pass + except (ValueError, UnicodeDecodeError, KeyError): + continue ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` around lines 479 - 480, Replace the bare "except: pass" in the keyword-search block with a narrowed exception handler that captures Exception as e and logs the error instead of silencing it; specifically, in the keyword search logic inside windsuf_trajectory/extractor.py (the try/except around the keyword matching code) change to "except Exception as e:" and call the module logger (e.g., logger.debug(...) or logger.exception(...)) with context about the record and the exception, or re-raise for non-recoverable errors if appropriate, so malformed data/errors aren’t silently dropped. ``` </details> --- `253-259`: **Size calculation is inconsistent with other functions.** `len(str(v))` gives the character count of the JSON-encoded value, while `extract_trajectory` uses `len(blob)` (actual decoded bytes). This could lead to confusing size discrepancies between `--list` and extract output. <details> <summary>♻️ Proposed fix for consistent sizing</summary> ```diff if "cachedActiveTrajectory" in k: ws_id = k.split(":")[-1] + try: + blob = base64.b64decode(v) + size = len(blob) + except Exception: + size = len(str(v)) workspaces.append( { "id": ws_id, - "size": len(str(v)), + "size": size, "path": workspace_name(ws_storage, ws_id), } ) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` around lines 253 - 259, The listed workspace "size" is computed with len(str(v)) which measures JSON text length and mismatches extract_trajectory's len(blob); change the computation in the workspaces.append block to compute the actual decoded blob byte length the same way extract_trajectory does (decode/deserialize v to the binary blob and use len(blob)) so sizes reported by the listing match sizes produced by extract_trajectory; look at the variables v, workspaces.append and the extract_trajectory sizing logic to mirror its decoding and length call. ``` </details> --- `219-223`: **Timestamp range may become a "time bomb" in 2040.** The comment acknowledges this, but the upper bound of 2208988800 (Jan 2040) means this tool will reject valid timestamps after that date. Consider either removing the upper bound or extending it significantly. <details> <summary>♻️ Proposed fix</summary> ```diff # Sanity check: timestamp should be reasonable (2020-2100) - if 1577836800 < seconds < 2208988800: + if 1577836800 < seconds < 4102444800: # 2020-2100 return datetime.fromtimestamp(seconds + nanos / 1e9, tz=tz) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` around lines 219 - 223, The timestamp sanity check in extractor.py that returns None for seconds outside (1577836800, 2208988800) will reject valid post-2040 timestamps; update the logic in the block handling seconds/nanos before datetime.fromtimestamp (the variables named seconds and nanos and the call to datetime.fromtimestamp(..., tz=tz)) to either remove the upper bound entirely and only guard against obviously invalid values (e.g., negative seconds) or extend the upper bound to a far-future value (e.g., year 2100+), and ensure the function's return None behavior remains consistent when timestamps are truly invalid. ``` </details> --- `36-37`: **Hardcoded timezone may not suit all users.** `DEFAULT_TZ` is set to CST (UTC+8). Consider either documenting this prominently or using the system's local timezone as default. <details> <summary>♻️ Option: Use local timezone</summary> ```diff -DEFAULT_TZ = timezone(timedelta(hours=8)) # CST +# Use local timezone by default +DEFAULT_TZ = datetime.now().astimezone().tzinfo ``` </details> Or simply document the CST default more prominently so users know to pass a custom `tz` parameter. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` around lines 36 - 37, DEFAULT_TZ is hardcoded to CST (UTC+8); change it to use the system local timezone and ensure callers can still pass a custom tz: replace the constant assignment with a runtime lookup (e.g., derive tz via datetime.now().astimezone().tzinfo) and update any functions that reference DEFAULT_TZ (look for references to DEFAULT_TZ and any function parameters named tz) to default tz=None and resolve to the system tz when None; also update the module docstring/comments to clearly state that the default is the system local timezone and how to pass a custom tz. ``` </details> --- `201-204`: **Consider narrowing exception handling.** The bare `except Exception:` catches broadly. For defensive protobuf parsing this is often acceptable, but catching `Exception` instead of a bare `except:` would avoid swallowing `KeyboardInterrupt` and `SystemExit`. <details> <summary>♻️ Proposed fix</summary> ```diff - except Exception: + except (ValueError, struct.error, IndexError): break ``` </details> Alternatively, if broad catching is intentional for unknown malformed data, the current approach is acceptable given the parsing context. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` around lines 201 - 204, Replace the broad bare except handler around the protobuf parsing loop with an explicit "except Exception as e" handler (the except block shown after the loop in windsurf_trajectory/extractor.py) so KeyboardInterrupt and SystemExit are not swallowed; capture the exception into a variable (e) and either log it or handle it appropriately, but avoid using a bare except: so only subclass-of-Exception errors are caught. ``` </details> </blockquote></details> <details> <summary>windsurf-trajectory-extractor/src/windsurf_trajectory/__init__.py (1)</summary><blockquote> `1-7`: **Consider re-exporting public API for better ergonomics.** The package root only exposes `__version__`. For library consumers, it would be more convenient to import directly from `windsurf_trajectory` rather than `windsurf_trajectory.extractor`. <details> <summary>♻️ Optional: Re-export public API</summary> ```diff """Windsurf Trajectory Extractor - Deep extraction of Cascade conversation history. This tool extracts complete trajectory data from Windsurf's internal storage, including thinking content, tool calls, and microsecond-precision timestamps. """ __version__ = "0.1.0" + +from .extractor import ( + DEFAULT_TZ, + extract_trajectory, + find_by_keywords, + find_windsurf_paths, + list_summaries, + list_workspaces, + load_codeium_state, + workspace_name, +) + +__all__ = [ + "__version__", + "DEFAULT_TZ", + "extract_trajectory", + "find_by_keywords", + "find_windsurf_paths", + "list_summaries", + "list_workspaces", + "load_codeium_state", + "workspace_name", +] ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@windsurf-trajectory-extractor/src/windsurf_trajectory/__init__.py` around lines 1 - 7, The package root currently only exposes __version__; to make imports ergonomic, re-export the public API from windsurf_trajectory.extractor by importing the extractor's public symbols (e.g., classes/functions like whatever public names are defined in windsurf_trajectory.extractor) into windsurf_trajectory.__init__ and adding them to __all__ (or expose the module as extractor via "from . import extractor as extractor") so consumers can "from windsurf_trajectory import <PublicName>" instead of importing from windsurf_trajectory.extractor; keep __version__ and ensure types and docstrings are preserved. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In@windsurf-trajectory-extractor/.gitignore:
- Around line 45-46: The global ignore entry ".jsonl" is too broad and may hide
non-generated fixtures; replace it with a narrower pattern that targets
generated outputs (for example a directory-specific pattern or a naming
convention) and keep the current exception "!examples/.jsonl" if examples
should remain tracked; update the .gitignore by removing the top-level ".jsonl"
and adding a more specific rule such as a generated/ or outputs/ directory
pattern (or a suffix like ".generated.jsonl") so only intended files are
ignored while legitimate JSONL assets outside examples remain visible.In
@windsurf-trajectory-extractor/README.md:
- Around line 104-119: The fenced protobuf structure block starting with
"Top-level:" (showing fields like f1, f2, repeated Step, f20 etc.) lacks a
language specifier; update the opening triple-backtick to include a language tag
(for example ```text) so the block becomes a proper code block with a language
identifier, leaving the block contents unchanged and keeping the closing
triple-backtick as-is.- Around line 96-98: The fenced code block in README.md containing the path
~/Library/Application Support/Windsurf - Next/User/globalStorage/state.vscdb
lacks a language specifier; update that code fence to include a language tag
(e.g., use "text") so the block starts with ```text to satisfy markdownlint and
enable proper highlighting/accessibility.In
@windsurf-trajectory-extractor/src/windsurf_trajectory/__init__.py:
- Around line 1-7: The package root currently only exposes version; to make
imports ergonomic, re-export the public API from windsurf_trajectory.extractor
by importing the extractor's public symbols (e.g., classes/functions like
whatever public names are defined in windsurf_trajectory.extractor) into
windsurf_trajectory.init and adding them to all (or expose the module as
extractor via "from . import extractor as extractor") so consumers can "from
windsurf_trajectory import " instead of importing from
windsurf_trajectory.extractor; keep version and ensure types and docstrings
are preserved.In
@windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py:
- Around line 479-480: Replace the bare "except: pass" in the keyword-search
block with a narrowed exception handler that captures Exception as e and logs
the error instead of silencing it; specifically, in the keyword search logic
inside windsuf_trajectory/extractor.py (the try/except around the keyword
matching code) change to "except Exception as e:" and call the module logger
(e.g., logger.debug(...) or logger.exception(...)) with context about the record
and the exception, or re-raise for non-recoverable errors if appropriate, so
malformed data/errors aren’t silently dropped.- Around line 253-259: The listed workspace "size" is computed with len(str(v))
which measures JSON text length and mismatches extract_trajectory's len(blob);
change the computation in the workspaces.append block to compute the actual
decoded blob byte length the same way extract_trajectory does
(decode/deserialize v to the binary blob and use len(blob)) so sizes reported by
the listing match sizes produced by extract_trajectory; look at the variables v,
workspaces.append and the extract_trajectory sizing logic to mirror its decoding
and length call.- Around line 219-223: The timestamp sanity check in extractor.py that returns
None for seconds outside (1577836800, 2208988800) will reject valid post-2040
timestamps; update the logic in the block handling seconds/nanos before
datetime.fromtimestamp (the variables named seconds and nanos and the call to
datetime.fromtimestamp(..., tz=tz)) to either remove the upper bound entirely
and only guard against obviously invalid values (e.g., negative seconds) or
extend the upper bound to a far-future value (e.g., year 2100+), and ensure the
function's return None behavior remains consistent when timestamps are truly
invalid.- Around line 36-37: DEFAULT_TZ is hardcoded to CST (UTC+8); change it to use
the system local timezone and ensure callers can still pass a custom tz: replace
the constant assignment with a runtime lookup (e.g., derive tz via
datetime.now().astimezone().tzinfo) and update any functions that reference
DEFAULT_TZ (look for references to DEFAULT_TZ and any function parameters named
tz) to default tz=None and resolve to the system tz when None; also update the
module docstring/comments to clearly state that the default is the system local
timezone and how to pass a custom tz.- Around line 201-204: Replace the broad bare except handler around the protobuf
parsing loop with an explicit "except Exception as e" handler (the except block
shown after the loop in windsurf_trajectory/extractor.py) so KeyboardInterrupt
and SystemExit are not swallowed; capture the exception into a variable (e) and
either log it or handle it appropriately, but avoid using a bare except: so only
subclass-of-Exception errors are caught.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `7aa7df9d-a8a7-4332-9c1a-c744523c84f8` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 58b78ef0ec134795b98a17033d2e5283f5d5a472 and a0f9cab1228a4fbc491644093a5cac47f1f372d5. </details> <details> <summary>📒 Files selected for processing (8)</summary> * `windsurf-trajectory-extractor/.gitignore` * `windsurf-trajectory-extractor/LICENSE` * `windsurf-trajectory-extractor/README.md` * `windsurf-trajectory-extractor/examples/sample_output.jsonl` * `windsurf-trajectory-extractor/pyproject.toml` * `windsurf-trajectory-extractor/src/windsurf_trajectory/__init__.py` * `windsurf-trajectory-extractor/src/windsurf_trajectory/cli.py` * `windsurf-trajectory-extractor/src/windsurf_trajectory/extractor.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
Thanks a lot for the amazing contribution! |
|
@jijiamoer could you please edit this change to make the two repos submodule in community repo, please? |
Replace the in-tree Windsurf sources with dedicated Antigravity and Windsurf submodules so the community repo follows the maintainer-requested integration pattern and points README entries at the standalone repos.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitmodules:
- Around line 4-9: Replace the SSH URLs for the two submodules
"windsurf-trajectory-extractor" and "antigravity-trajectory-extractor" with
their HTTPS equivalents in the .gitmodules entry (change
git@github.com:owner/repo.git to https://github.com/owner/repo.git), then sync
the change so existing clones pick up the new URL (e.g., run git submodule sync
and git submodule update --init --recursive) to avoid SSH auth failures for
contributors/CI.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 86d20eeb-7ef2-4596-8a34-690172c2ecb6
📒 Files selected for processing (4)
.gitmodulesREADME.mdantigravity-trajectory-extractorwindsurf-trajectory-extractor
✅ Files skipped from review due to trivial changes (2)
- antigravity-trajectory-extractor
- windsurf-trajectory-extractor
|
THANKS so much @jijiamoer !! |
Summary
A Python CLI tool for extracting Windsurf Cascade conversation trajectories with deep protobuf decoding.
Features
Technical Highlights
WindsurfandWindsurf - NextinstallationsDifferentiator
Unlike JSON-based extraction tools (e.g.,
ai-data-extraction), this performs protobuf decoding to access:Usage
Files
src/windsurf_trajectory/extractor.py- Core extraction logic (~480 lines)src/windsurf_trajectory/cli.py- CLI interfaceexamples/sample_output.jsonl- Sample output formatChecklist
Summary by CodeRabbit
New Features
Documentation