Skip to content

Commit 0592f4f

Browse files
authored
Reflect that dill is no longer a default pickler. (#36903)
* Update python-pipeline-dependencies.md * Update python-pipeline-dependencies.md
1 parent c490d2d commit 0592f4f

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -163,22 +163,20 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
163163
## Pickling and Managing the Main Session
164164

165165
When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using
166-
libraries that perform the serialization (also called picklers). The default pickler library used by Beam is `dill`.
167-
To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option.
166+
libraries that perform the serialization (also called picklers). On Apache Beam 2.64.0 or earlier, the default pickler library was `dill`.
168167

169-
By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job.
168+
When `dill` pickler is used, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job by default.
170169
Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner. To resolve this, supply the main session content with the pipeline by
171170
setting the `--save_main_session` pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers (if using `DataflowRunner`).
172171
For example, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#name-error) to set the main session on the `DataflowRunner`.
173172

174-
Managing the main session in Python SDK is only necessary when using `dill` pickler on any remote runner. Therefore, this issue will
175-
not occur in `DirectRunner`.
176-
177173
Since serialization of the pipeline happens on the job submission, and deserialization happens at runtime, it is imperative that the same version of pickling library is used at job submission and at runtime.
178-
To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
179-
install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
174+
To ensure this, Beam users who use `dill` and choose to install a custom version of dill, must also ensure that they use the same custom version at runtime (e.g. in their custom container,
180175
or by specifying a pipeline dependency requirement).
181176

177+
The `--save_main_session` pipeline options is not necessary when `cloudpickle` pickler is used, which is the default pickler on Apache Beam 2.65.0 and later versions.
178+
To use the `cloudpickle` pickler on the earlier Beam versions, supply the `--pickle_library=cloudpickle` pipeline option.
179+
182180
## Control the dependencies the pipeline uses {#control-dependencies}
183181

184182
### Pipeline environments

0 commit comments

Comments
 (0)