-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Add Iceberg CDC support to YAML #36641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @damccorm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
| catalog_name: "shipment_data" | ||
| catalog_properties: | ||
| type: "rest" | ||
| uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So seems like these are pure examples currently. Is the plan to parameterize these in the future so that these an be used in Flex template and Job builder blueprints ? If so can we add such parameterization now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I think we need to consider that the golden blueprints that are used in the Dataflow templates repo have to be properly tested:
Ideally, we should probably have:
- One example blueprint that is properly filled out for users to use that has mock testing.
- That same example blueprint that has jinja variable parameterization and has integration testing so that we know it works correctly.
- Any additional transform testing for precommit or postcommits.
What do you think about this:
- Add a golden jinja yaml blueprint in the extended_tests/blueprints that we can use for both integration testing and for eventually linking in the Dataflow templates repo.
- Create the examples blueprints based on these golden ones dynamically.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with these in ideal scenario.But i think we cannot do exact integration tests for these YAML files, especially streaming pipelines which never end. We any way are writing Integration tests in extended_tests which are close to blueprints. May be we can directly add jinja parameterized blueprints directly to blueprints folder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding more, consider this CDC usecase. First we need to have a iceberg table. So, we cannot just have a raw blueprint that is testable. We need to create them(which we already do in integration tests) or maintain them, still i do not think every blueprint is testable as its own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to consider developing the main pipeline and testing separately.
-
Main pipeline should be something parameterized that we can use for templates / blueprints. This is our source of truth and contains the main code we need to to keep correct via testing etc. This should not have specific parameters (references to specific databases etc.) unless we want to make that the default for the template/blueprint.
-
A close enough pipeline that we want to test dynamically derived from above. In some cases this can be the original pipeline (for example, Kafka which can be run for a limited time). But for Pub/Sub we either need to inject mock transforms and/or cancel the pipeline using a runner specific mechanism (for example, Dataflow cancel operation [1]).
[1] https://docs.cloud.google.com/dataflow/docs/guides/stopping-a-pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with Danny offline and already mentioned to Tarun, but we are moving everything to Dataflow repo as the source of truth to reduce redundancy and complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion i have removed Blueprints in this repo. But left the integration tests. As we define the YAML->SDK mapping in this repo. We should still continue adding these tests. Specific integration tests for blueprints will go to the Templates repo.
@chamikaramj do we have to wait for some beam release to start using the new YAML features added to this repo in the Templates repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can add CDC support for YAML and tests here.
…into iceberg_cdc_blueprint
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Could you also update the CL title and the description to match the updated version ?
| - type: AssertEqual | ||
| config: | ||
| elements: | ||
| - {label: "11a", rank: 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the test to include more than one element. For example, a stream of data read for a predefined Iceberg that span multiple snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a filter on timestamp for this decade :)) also added multiple conditions to get more than one record. Could not do streaming or snapshot filters in the existing setup, Snapshot is catalog level metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Adding a streaming pipeline integration test too with timestamp cut off.
| catalog_name: "shipment_data" | ||
| catalog_properties: | ||
| type: "rest" | ||
| uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can add CDC support for YAML and tests here.
|
Run Python_Coverage PreCommit 3.10 |
Expose IcebergCDC through YAML, Add Iceberg-to-Iceberg streaming and batch integration tests
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.