Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` #533

jvanasco · 2025-11-06T23:12:14Z

We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.

There were two options to achieve that:

1- Download the URL ourselves, then parse that with feedparser.
2- Extend feedparser to save the "raw" data

This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:

introduces archive_url_data:bool to feedparser.parse.
if set, a .raw attribute on the result FeedParserDict will contain the "content" and "headers"
headers are copied to this BEFORE they are updated by kwargs
extends feedparser.api._open_resource to return the "type" of data accessed, in addition to the data
Additionally, request_hooks are added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (The response_peername__hook needs to execute before content is read from the connection.)

I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.

import typing

import feedparser
from feedparser.http import RequestHooks

from metadata_parser.requests_extensions import response_peername__hook

if typing.TYPE_CHECKING:
    from requests import Response

    from feedparser.util import FeedParserDict

def process_result(response: "Response", result: "FeedParserDict") -> None:
    result.raw["peername"] = response._mp_peername


request_hooks: RequestHooks = {
    "response": response_peername__hook,
    "response.postprocess": process_result,
}

feed = feedparser.parse(
    "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
    archive_url_data=True,
    requests_hooks=request_hooks,
)

print("Feed was downloaded from:", feed.raw["peername"])

Fixes: #289

for more information, see https://pre-commit.ci

jvanasco and others added 5 commits November 6, 2025 16:48

introduce archive_url_data to feedparser.parse

406a02e

supporting requests hooks

9d7bf4e

[pre-commit.ci] auto fixes from pre-commit.com hooks

034c5e1

for more information, see https://pre-commit.ci

fix changes from pre-commit.ci -- how did those even happen?!?

b5b6d11

NotRequired is not available on python 3.10

a7a03d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` #533

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` #533

Uh oh!

jvanasco commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Extend feedparser.parse() with archive_url_data:bool and request_hooks #533

Are you sure you want to change the base?

Extend feedparser.parse() with archive_url_data:bool and request_hooks #533

Uh oh!

Conversation

jvanasco commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` #533

Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks` #533

jvanasco commented Nov 6, 2025 •

edited

Loading