Skip to content

Conversation

@jvanasco
Copy link

@jvanasco jvanasco commented Nov 6, 2025

We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.

There were two options to achieve that:

1- Download the URL ourselves, then parse that with feedparser.
2- Extend feedparser to save the "raw" data

This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:

  1. introduces archive_url_data:bool to feedparser.parse.
    if set, a .raw attribute on the result FeedParserDict will contain the "content" and "headers"
    headers are copied to this BEFORE they are updated by kwargs

  2. extends feedparser.api._open_resource to return the "type" of data accessed, in addition to the data

  3. Additionally, request_hooks are added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (The response_peername__hook needs to execute before content is read from the connection.)

I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.

import typing

import feedparser
from feedparser.http import RequestHooks

from metadata_parser.requests_extensions import response_peername__hook

if typing.TYPE_CHECKING:
    from requests import Response

    from feedparser.util import FeedParserDict

def process_result(response: "Response", result: "FeedParserDict") -> None:
    result.raw["peername"] = response._mp_peername


request_hooks: RequestHooks = {
    "response": response_peername__hook,
    "response.postprocess": process_result,
}

feed = feedparser.parse(
    "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
    archive_url_data=True,
    requests_hooks=request_hooks,
)

print("Feed was downloaded from:", feed.raw["peername"])

Fixes: #289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Export feed (FeedParserDict element) to rss xml?

1 participant