Skip to content

[security] fix(core): guard RSS transcript fetches#238

Closed
Hinotoi-agent wants to merge 2 commits into
steipete:mainfrom
Hinotoi-agent:fix/guard-rss-transcript-fetches
Closed

[security] fix(core): guard RSS transcript fetches#238
Hinotoi-agent wants to merge 2 commits into
steipete:mainfrom
Hinotoi-agent:fix/guard-rss-transcript-fetches

Conversation

@Hinotoi-agent
Copy link
Copy Markdown
Contributor

@Hinotoi-agent Hinotoi-agent commented Jun 5, 2026

Summary

This PR hardens the RSS podcast transcript trust boundary so feed-controlled <podcast:transcript> URLs cannot drive host-side transcript fetching into local or private network services.

  • Fixes a server-side request forgery risk in the RSS transcript provider path.
  • Validates transcript URLs before any fetch, including rejected schemes and local/private address ranges.
  • Resolves DNS for transcript hostnames before fetching, rejects any private/link-local/loopback result, and pins the fetch dispatcher to the already validated address set.
  • Revalidates and pins every redirect target instead of relying on automatic redirect following.
  • Adds regression coverage for direct loopback URLs, redirect-to-loopback URLs, hostnames resolving to private addresses, redirect hostnames resolving to private addresses, and pinned-fetch behavior.
  • Adds redacted after-fix runtime proof showing a feed-controlled loopback transcript URL is blocked before the local service is contacted.

Security issues covered

Issue Impact Severity
RSS podcast transcript URLs can trigger SSRF to local/private network services A malicious or compromised feed can cause the host-side transcript extraction path to request loopback, link-local, RFC1918/private, or otherwise internal URL literals/hostnames and treat the response as transcript content. High under deployments that summarize attacker-controlled podcast/RSS URLs; Medium hardening if all feed inputs are trusted/operator-controlled.

Before this PR

  • tryFetchTranscriptFromFeedXml() decoded the feed-controlled transcript url attribute and passed it directly to the transcript fetch implementation.
  • Transcript fetches used automatic redirects, so a public transcript URL could redirect into a local/private/internal host after the initial request.
  • Hostname URLs were not DNS-resolved by the transcript guard, so a hostname resolving or rebinding to loopback/private/link-local space could still reach runtime DNS.
  • The RSS transcript tests did not lock in SSRF protection for direct loopback transcript URLs, DNS-resolved private hostnames, redirect-based bypasses, or DNS-pinned fetch behavior.

After this PR

  • RSS transcript URLs are parsed and validated before fetching.
  • Non-HTTP(S) schemes and local/private/internal hostname literals are rejected before the fetch implementation is called.
  • Transcript hostnames are resolved with dns.lookup(..., { all: true, verbatim: true }); empty results or any blocked address reject the transcript URL before fetch.
  • For allowed DNS hostnames, transcript fetches use an Undici dispatcher whose lookup returns only the already validated address set, preventing a second runtime DNS lookup from rebinding the request to a private address.
  • Transcript fetches use manual redirect handling, and every Location target repeats the same URL, DNS, private-address, and pinned-dispatcher handling before the next request.
  • Regression tests cover direct loopback transcript URLs, public-to-loopback redirects, hostnames resolving to private addresses, redirect hostnames resolving to private addresses, and pinned dispatcher behavior.

Why this matters

Podcast/RSS feed XML is content controlled by the feed publisher, and feed URLs may be user-supplied in deployments that summarize external podcasts. Without a URL and DNS boundary at the nested transcript-fetch layer, feed content can move the host process from normal remote content retrieval into requests against services that were only intended to be reachable from the local machine or private network.

Attack flow

attacker controls or influences a podcast/RSS feed
    -> feed includes <podcast:transcript url="http://127.0.0.1:...">
       or <podcast:transcript url="https://attacker-hostname.example/...">
    -> RSS transcript provider decodes the nested URL
    -> host-side fetch requests the local/private target or follows a redirect/rebinding hostname
    -> response is accepted as transcript text

Affected code

Issue Files
RSS transcript SSRF via feed-controlled transcript URLs, DNS-resolved hostnames, and redirects packages/core/src/content/transcript/providers/podcast/rss-transcript.ts, tests/security.rss-transcript-ssrf.test.ts, packages/core/package.json, pnpm-lock.yaml

Root cause

Issue: RSS podcast transcript URLs can trigger SSRF to local/private network services

  • The nested transcript URL was trusted after XML extraction, even though it came from feed-controlled RSS content.
  • Automatic redirect following let the final fetch target differ from the initially selected URL without another safety check.
  • Hostnames were not resolved and pinned by the transcript guard before runtime fetch, leaving DNS rebinding/private-resolution bypasses.
  • The provider lacked regression tests for local/private transcript URL denial, DNS resolution denial, and pinned-fetch behavior.

CVSS assessment

Issue CVSS v3.1 Vector
RSS podcast transcript SSRF 8.2 High CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:L

Rationale:

  • The High assessment assumes a deployment where unauthenticated or low-friction users can submit attacker-controlled podcast/RSS URLs for summarization, and where the daemon or extension runs with host/network access to private services.
  • Confidentiality is the primary concern because internal service responses can be fetched and treated as transcript text; availability impact is bounded to possible request pressure or redirect behavior.
  • If maintainers consider all feed URLs trusted/operator-controlled, this is still useful Medium hardening for a brittle network trust boundary.

Safe reproduction / after-fix proof

Direct loopback transcript URL

  1. Use RSS XML containing a <podcast:transcript> element whose url points at a loopback literal such as http://127.0.0.1:8080/transcript.txt.
  2. Run the RSS transcript extraction path.
  3. After this PR, the transcript URL is rejected before the fetch implementation contacts the local service.

Redacted terminal output from an after-fix runtime scenario with a real local HTTP listener:

$ pnpm -s tsx /tmp/summarize-rss-transcript-ssrf-proof.ts
RSS transcript SSRF after-fix runtime proof
scenario: feed-controlled transcript URL -> http://127.0.0.1:58843/admin/metadata?token=[REDACTED]
result: blocked/no transcript returned
local_service_hits: 0
notes: RSS <podcast:transcript> fetch failed: RSS transcript URL resolves to a blocked local network address

Public URL redirecting or rebinding to private space

  1. Use RSS XML containing a <podcast:transcript> URL that initially points at a public-looking HTTPS URL.
  2. Have the first response redirect to a loopback/private target, or have a redirect hostname resolve to loopback/private space.
  3. After this PR, the redirect target is manually revalidated; private address results are rejected before the redirected request is fetched.
  4. For allowed public DNS results, the fetch dispatcher is pinned to the validated address set to avoid DNS rebinding between validation and connect.

The regression tests added in this PR use mocked fetches/DNS where appropriate and do not contact production services or real private-network endpoints.

Expected vulnerable behavior

  • A feed-controlled transcript URL pointing at localhost, loopback/private address literals, or hostnames resolving to private space can be sent to the host-side fetch implementation.
  • A public transcript URL can redirect to a local/private target without redirect-target validation.
  • Runtime DNS can differ from pre-check assumptions when a hostname rebinds between validation and connect.
  • The returned response can be accepted as transcript content by the summarization path.

Changes in this PR

  • Adds transcript URL validation before fetch.
  • Rejects non-HTTP(S) transcript URL schemes.
  • Blocks localhost and local/private/internal IPv4 literal ranges, including loopback, link-local, RFC1918/private, shared address space, benchmarking, multicast, and reserved ranges.
  • Blocks local/private/internal IPv6 literal ranges, including unspecified, loopback, unique-local, link-local, multicast, documentation, and IPv4-mapped/compatible addresses that resolve to blocked IPv4 ranges.
  • Resolves transcript hostnames before fetch and rejects empty DNS results or any blocked address.
  • Pins allowed hostname fetches to the validated address set using an Undici dispatcher.
  • Switches transcript fetches to redirect: "manual".
  • Revalidates each redirect Location before following it and applies the same DNS/private-address/pinned-dispatcher behavior to redirect targets.
  • Caps transcript redirect handling to avoid unbounded redirect loops.
  • Adds regression tests for direct loopback transcript URLs, public-to-loopback redirects, private DNS resolution, redirect-private DNS resolution, and pinned dispatcher behavior.

Files changed

Category Files What changed
RSS transcript provider hardening packages/core/src/content/transcript/providers/podcast/rss-transcript.ts Adds URL parsing, blocked host detection, DNS resolution, private-address rejection, pinned dispatcher fetches, manual redirect handling, redirect revalidation, and redirect-depth limiting.
Core dependency metadata packages/core/package.json, pnpm-lock.yaml Declares the existing Undici runtime dependency for the core transcript fetch guard's pinned dispatcher.
Security regression tests tests/security.rss-transcript-ssrf.test.ts Covers direct loopback rejection, redirect-to-loopback rejection, hostname-to-private DNS rejection, redirect hostname-to-private DNS rejection, and pinned fetch behavior.

Maintainer impact

  • The patch is scoped to RSS podcast transcript fetching.
  • Normal HTTP(S) transcript URLs on public hosts remain supported.
  • Redirects still work when every redirect target remains within the allowed public HTTP(S) URL and public DNS boundary.
  • The main behavior change is that transcript URLs resolving to local/private/internal literal hosts, unsupported schemes, private DNS answers, or unsafe redirect targets are skipped instead of fetched.

Fix rationale

  • The trust boundary belongs at the nested transcript URL fetch because the URL is declared by RSS feed content, not by local application code.
  • DNS validation is required because URL literal checks do not cover hostnames resolving or rebinding to private space.
  • Address pinning is required because checking DNS before fetch is not sufficient if the runtime fetch performs a second lookup that can rebind.
  • Manual redirect handling is required because validating only the initial URL leaves a public-to-private redirect bypass.
  • The tests exercise the security boundary with mocked fetches/DNS, so the regression signal is deterministic and does not depend on real internal services.

Type of change

  • Security fix
  • Tests
  • Documentation update
  • Refactor with no behavior change

Test plan

  • pnpm -s test tests/security.rss-transcript-ssrf.test.ts5 tests passed.
  • git diff --check && pnpm -s typecheck — passed.
  • pnpm -s format:check — passed; all matched files used the correct format.
  • pnpm -s lint — passed with 0 warnings and 0 errors.
  • pnpm -s tsx /tmp/summarize-rss-transcript-ssrf-proof.ts — runtime proof passed; local listener received 0 requests.

Executed with:

  • pnpm -s test tests/security.rss-transcript-ssrf.test.ts
  • git diff --check && pnpm -s typecheck
  • pnpm -s format:check
  • pnpm -s lint
  • pnpm -s tsx /tmp/summarize-rss-transcript-ssrf-proof.ts

Note: local validation ran under Node v22.22.2 even though the repo declares node >=24; the same targeted tests, typecheck, format check, and lint completed successfully in this workspace.

Disclosure notes

  • This PR is bounded to RSS podcast transcript URL fetching and redirect handling.
  • The reproduction coverage is safe; no production services or real private-network endpoints were contacted.
  • Runtime proof used a local loopback listener and redacted sensitive-looking query/body values.
  • No unrelated files are changed.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 5, 2026

Codex review: needs maintainer review before merge. Reviewed June 4, 2026, 10:29 PM ET / 02:29 UTC.

Summary
The PR hardens core RSS <podcast:transcript> fetching by validating URL schemes, hostnames, DNS answers, redirects, and pinned Undici dispatch, while adding core dependency metadata and SSRF regression tests.

Reproducibility: yes. from source inspection: current main decodes the RSS transcript URL and fetches it with automatic redirect following. The PR body also includes redacted terminal proof showing a loopback transcript URL blocked with local_service_hits: 0.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 4 files, +405/-6. The patch is scoped, but it touches core runtime fetching, dependency metadata, and security tests.
  • Security coverage: 1 new test file with 5 cases. The new tests cover the main SSRF bypass classes named in the PR body.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] Existing operator-controlled feeds that intentionally publish localhost, private-network, or internal transcript URLs will now skip RSS transcript fetches instead of fetching them.
  • [P1] The new core guard is copied from the daemon URL guard, so future changes should keep both network boundaries aligned or factor shared logic.
  • [P1] This read-only review did not run the test suite; validation relies on source inspection plus the contributor's PR-body terminal proof and reported test output.

Maintainer options:

  1. Accept the hardened transcript boundary (recommended)
    Merge after a maintainer accepts fail-closed behavior for local/private/internal RSS transcript URLs as the intended security posture.
  2. Document the compatibility behavior first
    Before merge, add maintainer-visible PR or release-note context explaining that unsafe RSS transcript URLs are skipped rather than fetched.
  3. Pause for shared guard design
    Pause if maintainers want the RSS transcript guard factored through a shared core/daemon URL-guard abstraction before this lands.

Next step before merge

  • [P2] The remaining action is maintainer approval of a security-sensitive compatibility change, not an automated code repair.

Security
Cleared: The diff is security hardening for an SSRF boundary and I did not find a concrete new security or supply-chain regression in the changed files.

Review details

Best possible solution:

Land a maintainer-approved security hardening that treats feed-declared transcript URLs as untrusted and keeps core RSS transcript behavior aligned with the daemon URL guard semantics.

Do we have a high-confidence way to reproduce the issue?

Yes from source inspection: current main decodes the RSS transcript URL and fetches it with automatic redirect following. The PR body also includes redacted terminal proof showing a loopback transcript URL blocked with local_service_hits: 0.

Is this the best way to solve the issue?

Yes, with maintainer acceptance of the compatibility change: validating before fetch, pinning DNS, and revalidating redirects is the narrow maintainable fix for this trust boundary. The safer long-term refinement would be sharing this guard with the existing daemon URL guard to avoid drift.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 821e76613ded.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted after-fix terminal output from a real local listener scenario showing the loopback transcript URL was blocked before contact.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes redacted after-fix terminal output from a real local listener scenario showing the loopback transcript URL was blocked before contact.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P1: This is a security-sensitive SSRF hardening in a user-facing transcript fetch path, with limited but real blast radius.
  • merge-risk: 🚨 compatibility: Merging intentionally makes local/private/internal RSS transcript URLs fail closed for existing feeds.
  • merge-risk: 🚨 security-boundary: The diff changes the network trust boundary for feed-controlled transcript URLs, DNS resolution, and redirects.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes redacted after-fix terminal output from a real local listener scenario showing the loopback transcript URL was blocked before contact.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted after-fix terminal output from a real local listener scenario showing the loopback transcript URL was blocked before contact.
Evidence reviewed

What I checked:

Likely related people:

  • Peter Steinberger: Git history and blame show the current RSS transcript provider, podcast RSS parsing split, and daemon URL fetch guard all coming through recent commits by this author. (role: recent area contributor; confidence: high; commits: fcf8c8e5e98d, 0ec12acc15c4, d1dbf0c396f8; files: packages/core/src/content/transcript/providers/podcast/rss-transcript.ts, packages/core/src/content/transcript/providers/podcast/provider-flow.ts, src/daemon/url-fetch-guard.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 Urgent regression or broken agent/channel workflow affecting real users now. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels Jun 5, 2026
@Hinotoi-agent Hinotoi-agent changed the title fix: guard rss transcript fetches [security] fix(core): guard RSS transcript fetches Jun 5, 2026
@clawsweeper clawsweeper Bot added the merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. label Jun 5, 2026
@Hinotoi-agent
Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

I pushed a follow-up at b86cd337db28e71e4629ab4b6382ea44a0c5ddee that addresses the P1 rank-up items:

  • transcript hostnames now resolve with dns.lookup(..., { all: true, verbatim: true }), reject blocked/private results, and fetch through a pinned Undici dispatcher;
  • redirect targets repeat the same DNS/private-address validation and pinned-dispatcher path;
  • regression coverage now includes private DNS answers, redirect hostnames resolving to private addresses, and pinned-fetch behavior;
  • the PR body now includes redacted after-fix runtime proof showing a feed-controlled loopback transcript URL was blocked before the local listener was contacted (local_service_hits: 0).

Local validation and GitHub CI are both green:

  • pnpm -s test tests/security.rss-transcript-ssrf.test.ts
  • git diff --check && pnpm -s typecheck
  • pnpm -s format:check
  • pnpm -s lint
  • pnpm -s tsx /tmp/summarize-rss-transcript-ssrf-proof.ts

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 5, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 5, 2026
@Hinotoi-agent
Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented Jun 5, 2026

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 5, 2026
@Hinotoi-agent
Copy link
Copy Markdown
Contributor Author

Closing this as superseded by #239, which fixes the same RSS podcast:transcript SSRF boundary in the same core path and has the latest runtime proof/body updates, including blocked local/private proof, public transcript success proof, and the core-vs-daemon guard boundary note.\n\nKeeping #239 open as the active PR to avoid duplicate open fixes for the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P1 Urgent regression or broken agent/channel workflow affecting real users now. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant