-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Allow users to specify trusted Avro serializable classes to Dataflow worker #36809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to specify trusted Avro serializable classes to Dataflow worker #36809
Conversation
…o Dataflow worker
Summary of ChangesHello @clairemcginty, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request proposes a solution to a runtime Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
3c20249 to
6bd13e1
Compare
6bd13e1 to
e027e2d
Compare
|
Assigning reviewers: R: @chamikaramj for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
cc @Abacn - I think you merged the original upgrade from Avro 1.8 -> 1.11 :) |
| } | ||
| } | ||
| // Add trusted Avro serializable classes | ||
| if serializableClasses, ok := pipelineOptions.GetStructValue().GetFields()["avroSerializableClasses"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have both; a default that matches avro and this option to override when needed. Ideally the avro default list should be public so users can extend/modify it as required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for classes already caused regressions ("java.math.BigDecimal", and if there are others) we can add it here for users
...low-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java
Outdated
Show resolved
Hide resolved
| @Default.InstanceFactory(AvroSerializableClassesFactory.class) | ||
| List<String> getAvroSerializableClasses(); | ||
|
|
||
| void setAvroSerializableClasses(List<String> options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said this should be an "SDKHarnessOptions": https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/options/SdkHarnessOptions.java
56bce77 to
7cf5b2c
Compare
|
testWriteUnboundedWithCustomBatchParameters (org.apache.beam.sdk.io.TextIOWriteTest) failed not related to the change. Merging for now |
See: https://github.com/apache/avro/pull/3376/files
In Avro 1.11.4+, schemas that utilize the
java-classannotation throw runtime decoding errors.For example, applying a specific AvroCoder to a record with the following schema:
throws:
The fix, per https://github.com/apache/avro/pull/3376/files, is to set the sys prop
org.apache.avro.SERIALIZABLE_CLASSES. We can do that easily with DirectRunner, but there's no way to set it on Dataflow unless we create a custom worker imageThis PR adds a proposed worker harness option allowing the user to specify trusted classes.
An alternative would be to simply set a reasonable default; for example in the Avro project itself they set the following default value:
this would also solve my use case :)
(Another option would be to solve this with a custom
AvroDatumFactory, but that does involve some knowledge of Avro internals...)Let me know what you all think. If this looks good I can polish the javadoc/add tests/etc.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.