Skip to content

Conversation

@tarun-google
Copy link
Contributor

@tarun-google tarun-google commented Oct 27, 2025

Expose IcebergCDC through YAML, Add Iceberg-to-Iceberg streaming and batch integration tests

Screenshot 2025-10-30 at 9 45 35 AM Screenshot 2025-10-30 at 9 41 05 AM

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@tarun-google tarun-google marked this pull request as ready for review October 30, 2025 16:53
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @damccorm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@tarun-google
Copy link
Contributor Author

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

catalog_name: "shipment_data"
catalog_properties:
type: "rest"
uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So seems like these are pure examples currently. Is the plan to parameterize these in the future so that these an be used in Flex template and Job builder blueprints ? If so can we add such parameterization now ?

Copy link
Collaborator

@derrickaw derrickaw Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I think we need to consider that the golden blueprints that are used in the Dataflow templates repo have to be properly tested:

Ideally, we should probably have:

  1. One example blueprint that is properly filled out for users to use that has mock testing.
  2. That same example blueprint that has jinja variable parameterization and has integration testing so that we know it works correctly.
  3. Any additional transform testing for precommit or postcommits.

What do you think about this:

  1. Add a golden jinja yaml blueprint in the extended_tests/blueprints that we can use for both integration testing and for eventually linking in the Dataflow templates repo.
  2. Create the examples blueprints based on these golden ones dynamically.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. Thanks!

Copy link
Contributor Author

@tarun-google tarun-google Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these in ideal scenario.But i think we cannot do exact integration tests for these YAML files, especially streaming pipelines which never end. We any way are writing Integration tests in extended_tests which are close to blueprints. May be we can directly add jinja parameterized blueprints directly to blueprints folder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more, consider this CDC usecase. First we need to have a iceberg table. So, we cannot just have a raw blueprint that is testable. We need to create them(which we already do in integration tests) or maintain them, still i do not think every blueprint is testable as its own

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to consider developing the main pipeline and testing separately.

  • Main pipeline should be something parameterized that we can use for templates / blueprints. This is our source of truth and contains the main code we need to to keep correct via testing etc. This should not have specific parameters (references to specific databases etc.) unless we want to make that the default for the template/blueprint.

  • A close enough pipeline that we want to test dynamically derived from above. In some cases this can be the original pipeline (for example, Kafka which can be run for a limited time). But for Pub/Sub we either need to inject mock transforms and/or cancel the pipeline using a runner specific mechanism (for example, Dataflow cancel operation [1]).

[1] https://docs.cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with Danny offline and already mentioned to Tarun, but we are moving everything to Dataflow repo as the source of truth to reduce redundancy and complexity.

Copy link
Contributor Author

@tarun-google tarun-google Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion i have removed Blueprints in this repo. But left the integration tests. As we define the YAML->SDK mapping in this repo. We should still continue adding these tests. Specific integration tests for blueprints will go to the Templates repo.

@chamikaramj do we have to wait for some beam release to start using the new YAML features added to this repo in the Templates repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can add CDC support for YAML and tests here.

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Could you also update the CL title and the description to match the updated version ?

- type: AssertEqual
config:
elements:
- {label: "11a", rank: 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the test to include more than one element. For example, a stream of data read for a predefined Iceberg that span multiple snapshots.

Copy link
Contributor Author

@tarun-google tarun-google Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a filter on timestamp for this decade :)) also added multiple conditions to get more than one record. Could not do streaming or snapshot filters in the existing setup, Snapshot is catalog level metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Adding a streaming pipeline integration test too with timestamp cut off.

catalog_name: "shipment_data"
catalog_properties:
type: "rest"
uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can add CDC support for YAML and tests here.

@tarun-google tarun-google changed the title Add Iceberg CDC support to YAML and Blueprints Add Iceberg CDC support to YAML Nov 11, 2025
@tarun-google
Copy link
Contributor Author

Run Python_Coverage PreCommit 3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants