Conversation
|
Check out this pull request on Review Jupyter notebook visual diffs & provide feedback on notebooks. Powered by ReviewNB |
|
There is already this:
https://examples.dask.org/machine-learning/text-vectorization.html
Maybe we can roll this into that somehow?
…On Mon, Jul 27, 2020 at 7:51 AM review-notebook-app[bot] < ***@***.***> wrote:
Check out this pull request on [image: ReviewNB]
<https://app.reviewnb.com/dask/dask-examples/pull/160>
Review Jupyter notebook visual diffs & provide feedback on notebooks.
------------------------------
*Powered by ReviewNB <https://www.reviewnb.com>*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#160 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHSTSPW2MMTLO25IZDR5WIABANCNFSM4PI3CMUA>
.
|
|
Merged them into a single "Working with text data" notebook that starts with different comparing different vectorizers (HashingVectorizer, CountVectorizer) and ends with the full pipeline. |
| " toolz.sliding_window(2, lengths)]\n", | ||
| "# Notice the persist here! More details later.\n", | ||
| "documents = db.from_delayed([load_news(x) for x in slices]).persist()\n", | ||
| "documents" |
There was a problem hiding this comment.
We could also call db.read_sequence(..., npartitions=10).persist() and then call client.rebalance()
Given that people are going to blindly copy-paste whatever we do anyway I'd personally rather that they see this. It's a bit more in line with ordinary behavior I think.
| "import dask_ml.feature_extraction.text" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", |
There was a problem hiding this comment.
I recommend merging adjacent code cells, if only to cut down on Ctrl-Enter pressing.
| "remote_vocabulary, = client.scatter([vocabulary], broadcast=True)\n", | ||
| "\n", | ||
| "vectorizer2 = dask_ml.feature_extraction.text.CountVectorizer(\n", | ||
| " vocabulary=remote_vocabulary\n", | ||
| ")" |
There was a problem hiding this comment.
Is there a well defined vocabulary that we can use somewhere? Maybe in nltk? I'm concerned that people will see this, and think that they should copy the vocabulary off of one CountVectorizer and then pass it to another.
Also, do we need the scatter? Can you verify that if vocabulary is included directly in the vocabulary= keyword argument that it will occupy only a single task, and not be in many of them?
There was a problem hiding this comment.
I'm not sure about nltk, but probably not worth adding it to the environment just for this example. I noted that you'd probably get this from an external source in practice.
I think that as of dask/dask-ml#719, the answer to your question about user-provided vocabulary being in one task is "yes". But that change hasn't been released yet.
No description provided.