Extend feedparser.parse() with archive_url_data:bool and request_hooks
#533
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.
There were two options to achieve that:
1- Download the URL ourselves, then parse that with feedparser.
2- Extend feedparser to save the "raw" data
This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:
introduces
archive_url_data:booltofeedparser.parse.if set, a
.rawattribute on the result FeedParserDict will contain the "content" and "headers"headers are copied to this BEFORE they are updated by kwargs
extends
feedparser.api._open_resourceto return the "type" of data accessed, in addition to the dataAdditionally,
request_hooksare added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (Theresponse_peername__hookneeds to execute before content is read from the connection.)I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.
Fixes: #289