-
Notifications
You must be signed in to change notification settings - Fork 540
fix: Do not share state between different crawlers unless requested #1669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
6ff0b3f to
31e16d2
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1669 +/- ##
==========================================
- Coverage 92.41% 92.40% -0.02%
==========================================
Files 157 157
Lines 10478 10488 +10
==========================================
+ Hits 9683 9691 +8
- Misses 795 797 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description) @janbuchar , @barjin, @B4nan. Could you please share your point of view? You can see the usage in code in the updated and new test in this PR. |
The Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs. Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit). |
B4nan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.
What about having an optional argument This will be an easy and clear default without the need for extra arguments and maximum flexibility for custom use cases. |
I have a hard time imagining that, could you sketch out some code samples? |
Only for discussion, types ignored for now.
Please check the latest commit. I added an example of how this could be done. (Please do not focus on that specific example; it is just to demonstrate the idea. The question is whether the |
|
Regarding this comment about Now the reasoning - we also have If we don't do this, it feels like a nasty BC that might surprise many people. Imagine you init the state before you run the crawler via |
This reverts commit 1bbb651.
632c844 to
a674038
Compare
a674038 to
01335f0
Compare
Description
idforBasicCrawler. This argument controls the shared state.BasicCrawlergets an automatically incrementedidto avoid unintentional sharing of state between crawlers.idto the crawler__init__.Issues
Closes: #1627
Testing