Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Jan 12, 2026

Description

  • Introduces a new argument id for BasicCrawler. This argument controls the shared state.
  • Each new instance of the BasicCrawler gets an automatically incremented id to avoid unintentional sharing of state between crawlers.
  • If two crawlers should use the same state, then it is possible to pass the same id to the crawler __init__.

Issues

Closes: #1627

Testing

  • Added tests.

@github-actions github-actions bot added this to the 132nd sprint - Tooling team milestone Jan 12, 2026
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 12, 2026
@Pijukatel Pijukatel changed the title Add crawler fix: Do not share state between different crawlers unless requested Jan 12, 2026
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.40%. Comparing base (0a0995c) to head (01335f0).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1669      +/-   ##
==========================================
- Coverage   92.41%   92.40%   -0.02%     
==========================================
  Files         157      157              
  Lines       10478    10488      +10     
==========================================
+ Hits         9683     9691       +8     
- Misses        795      797       +2     
Flag Coverage Δ
unit 92.40% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Pijukatel
Copy link
Collaborator Author

Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description)

@janbuchar , @barjin, @B4nan. Could you please share your point of view?

You can see the usage in code in the updated and new test in this PR.

@barjin
Copy link
Member

barjin commented Jan 12, 2026

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.

@Pijukatel
Copy link
Collaborator Author

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

This will be an easy and clear default without the need for extra arguments and maximum flexibility for custom use cases.

@janbuchar
Copy link
Collaborator

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

I have a hard time imagining that, could you sketch out some code samples?

Only for discussion, types ignored for now.
@Pijukatel
Copy link
Collaborator Author

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

I have a hard time imagining that, could you sketch out some code samples?

Please check the latest commit. I added an example of how this could be done. (Please do not focus on that specific example; it is just to demonstrate the idea. The question is whether the use_state should be some hardcoded internal that can be parametrized, or if it should be a component of the crawler that can be fully replaced by a custom implementation. )

@B4nan
Copy link
Member

B4nan commented Jan 15, 2026

Regarding this comment about Actor.useState vs Crawler.useState, I'd actually say they should point to the same state by default, let's call it id: 0. If you would create two crawlers in one context, the first would use the same state as the SDK.

Now the reasoning - we also have context.useState and I'd say it makes sense to allow people use Actor.useState and context.useState together.

If we don't do this, it feels like a nasty BC that might surprise many people. Imagine you init the state before you run the crawler via Actor.useState and then you use context.useState in the handler, but the initial state is not there, forcing you to define it in the handler again.

@Pijukatel Pijukatel requested a review from janbuchar January 15, 2026 12:33
@Pijukatel Pijukatel marked this pull request as ready for review January 15, 2026 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port fix of multiple crawler instances sharing state from JS version

5 participants