Add Iceberg CDC support to YAML #36641

tarun-google · 2025-10-27T23:41:19Z

Expose IcebergCDC through YAML, Add Iceberg-to-Iceberg streaming and batch integration tests

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2025-10-30T18:42:58Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2025-10-30T21:07:05Z

Assigning reviewers:

R: @damccorm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

tarun-google · 2025-10-30T21:10:01Z

R: @chamikaramj @ahmedabu98 @derrickaw

github-actions · 2025-10-30T21:11:06Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

chamikaramj · 2025-11-03T19:30:18Z

sdks/python/apache_beam/yaml/examples/transforms/blueprint/iceberg_to_iceberg_streaming.yaml

+        catalog_name: "shipment_data"
+        catalog_properties:
+          type: "rest"
+          uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog"


So seems like these are pure examples currently. Is the plan to parameterize these in the future so that these an be used in Flex template and Job builder blueprints ? If so can we add such parameterization now ?

This is a good point. I think we need to consider that the golden blueprints that are used in the Dataflow templates repo have to be properly tested:

Ideally, we should probably have:

One example blueprint that is properly filled out for users to use that has mock testing.

That same example blueprint that has jinja variable parameterization and has integration testing so that we know it works correctly.

Any additional transform testing for precommit or postcommits.

What do you think about this:

Add a golden jinja yaml blueprint in the extended_tests/blueprints that we can use for both integration testing and for eventually linking in the Dataflow templates repo.

Create the examples blueprints based on these golden ones dynamically.

Thoughts?

That sounds good to me. Thanks!

I agree with these in ideal scenario.But i think we cannot do exact integration tests for these YAML files, especially streaming pipelines which never end. We any way are writing Integration tests in extended_tests which are close to blueprints. May be we can directly add jinja parameterized blueprints directly to blueprints folder ?

Adding more, consider this CDC usecase. First we need to have a iceberg table. So, we cannot just have a raw blueprint that is testable. We need to create them(which we already do in integration tests) or maintain them, still i do not think every blueprint is testable as its own

I think it's good to consider developing the main pipeline and testing separately.

Main pipeline should be something parameterized that we can use for templates / blueprints. This is our source of truth and contains the main code we need to to keep correct via testing etc. This should not have specific parameters (references to specific databases etc.) unless we want to make that the default for the template/blueprint.

A close enough pipeline that we want to test dynamically derived from above. In some cases this can be the original pipeline (for example, Kafka which can be run for a limited time). But for Pub/Sub we either need to inject mock transforms and/or cancel the pipeline using a runner specific mechanism (for example, Dataflow cancel operation [1]).

[1] https://docs.cloud.google.com/dataflow/docs/guides/stopping-a-pipeline

Chatted with Danny offline and already mentioned to Tarun, but we are moving everything to Dataflow repo as the source of truth to reduce redundancy and complexity.

As per discussion i have removed Blueprints in this repo. But left the integration tests. As we define the YAML->SDK mapping in this repo. We should still continue adding these tests. Specific integration tests for blueprints will go to the Templates repo.

@chamikaramj do we have to wait for some beam release to start using the new YAML features added to this repo in the Templates repo?

No, we can add CDC support for YAML and tests here.

…into iceberg_cdc_blueprint

chamikaramj

Thanks!

Could you also update the CL title and the description to match the updated version ?

chamikaramj · 2025-11-10T18:48:18Z

sdks/python/apache_beam/yaml/extended_tests/databases/iceberg.yaml

+        - type: AssertEqual
+          config:
+            elements:
+              - {label: "11a", rank: 0}


Can we update the test to include more than one element. For example, a stream of data read for a predefined Iceberg that span multiple snapshots.

Added a filter on timestamp for this decade :)) also added multiple conditions to get more than one record. Could not do streaming or snapshot filters in the existing setup, Snapshot is catalog level metadata

Update: Adding a streaming pipeline integration test too with timestamp cut off.

chamikaramj · 2025-11-10T18:49:14Z

sdks/python/apache_beam/yaml/examples/transforms/blueprint/iceberg_to_iceberg_streaming.yaml

+        catalog_name: "shipment_data"
+        catalog_properties:
+          type: "rest"
+          uri: "https://biglake.googleapis.com/iceberg/v1beta/restcatalog"


No, we can add CDC support for YAML and tests here.

tarun-google · 2025-11-18T19:10:15Z

Run Python_Coverage PreCommit 3.10

Add Iceberg CDC support to YAML and Blueprints

a9c8ac9

github-actions bot added python yaml labels Oct 27, 2025

tarun-google added 3 commits October 28, 2025 08:58

Fix Lint

38fe11f

Add Filters to integration test

e32ed12

Fix Mock Tests

3b44c3a

tarun-google marked this pull request as ready for review October 30, 2025 16:53

Merge branch 'apache:master' into iceberg_cdc_blueprint

e0f8a19

github-actions bot added the Next Action: Reviewers label Oct 30, 2025

derrickaw approved these changes Nov 3, 2025

View reviewed changes

chamikaramj reviewed Nov 3, 2025

View reviewed changes

tarun-google added 2 commits November 10, 2025 08:44

Remove Iceberg Blueprints from Beam Repo

8768fe8

Merge branch 'iceberg_cdc_blueprint' of github.com:tarun-google/beam …

7cb516b

…into iceberg_cdc_blueprint

tarun-google requested a review from chamikaramj November 10, 2025 16:49

Remove mock tests

7efe087

chamikaramj reviewed Nov 10, 2025

View reviewed changes

tarun-google added 2 commits November 11, 2025 11:24

Adding timestamps

4ee32d2

Add Streaming test

216afae

tarun-google changed the title ~~Add Iceberg CDC support to YAML and Blueprints~~ Add Iceberg CDC support to YAML Nov 11, 2025

tarun-google requested a review from chamikaramj November 11, 2025 21:08

Merge branch 'apache:master' into iceberg_cdc_blueprint

bb0e4bc

Add Iceberg CDC support to YAML #36641

Are you sure you want to change the base?

Add Iceberg CDC support to YAML #36641

Conversation

tarun-google commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

tarun-google commented Oct 30, 2025

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derrickaw Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarun-google Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarun-google Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarun-google Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarun-google commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tarun-google commented Oct 27, 2025 •

edited

Loading

derrickaw Nov 3, 2025 •

edited

Loading

tarun-google Nov 3, 2025 •

edited

Loading

tarun-google Nov 10, 2025 •

edited

Loading

tarun-google Nov 11, 2025 •

edited

Loading