You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
+6-8Lines changed: 6 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -163,22 +163,20 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
163
163
## Pickling and Managing the Main Session
164
164
165
165
When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using
166
-
libraries that perform the serialization (also called picklers). The default pickler library used by Beam is `dill`.
167
-
To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option.
166
+
libraries that perform the serialization (also called picklers). On Apache Beam 2.64.0 or earlier, the default pickler library was `dill`.
168
167
169
-
By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job.
168
+
When `dill` pickler is used, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job by default.
170
169
Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner. To resolve this, supply the main session content with the pipeline by
171
170
setting the `--save_main_session` pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers (if using `DataflowRunner`).
172
171
For example, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#name-error) to set the main session on the `DataflowRunner`.
173
172
174
-
Managing the main session in Python SDK is only necessary when using `dill` pickler on any remote runner. Therefore, this issue will
175
-
not occur in `DirectRunner`.
176
-
177
173
Since serialization of the pipeline happens on the job submission, and deserialization happens at runtime, it is imperative that the same version of pickling library is used at job submission and at runtime.
178
-
To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to
179
-
install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container,
174
+
To ensure this, Beam users who use `dill` and choose to install a custom version of dill, must also ensure that they use the same custom version at runtime (e.g. in their custom container,
180
175
or by specifying a pipeline dependency requirement).
181
176
177
+
The `--save_main_session` pipeline options is not necessary when `cloudpickle` pickler is used, which is the default pickler on Apache Beam 2.65.0 and later versions.
178
+
To use the `cloudpickle` pickler on the earlier Beam versions, supply the `--pickle_library=cloudpickle` pipeline option.
179
+
182
180
## Control the dependencies the pipeline uses {#control-dependencies}
0 commit comments