[DCP Ingestion] Connect the CDC Data Job to the ingestion workflow by gmechali · Pull Request #37 · datacommonsorg/datacommons

gmechali · 2026-04-30T18:06:42Z

Passes in the environment variables relating to the ingestion workflow into the CDC data job.
Ensures the SA running CDC data Job can trigger the workflow, and the SA running dataflow can write to the GCS bucket.

gemini-code-assist

Code Review

This pull request updates the Google provider version, configures environment variables for Cloud Run jobs to support workflows, and modifies Spanner instance settings including the edition and processing units. Feedback highlights a logic error in conditional IAM role assignment for Spanner, suggests reducing default processing units to manage costs, recommends decoupling modules by avoiding hardcoded service account strings, and advises using more restrictive storage roles to follow the principle of least privilege.

gemini-code-assist · 2026-04-30T18:09:27Z

+    "roles/spanner.databaseUser",
+    "roles/workflows.invoker"


The logic for conditionally granting roles/spanner.databaseUser appears to be incorrect. The role is currently included in the static list and then concatenated again if var.use_spanner is false, meaning it is granted regardless of the setting. If Spanner access should be conditional, the role should be removed from the static list and the ternary logic should be corrected to var.use_spanner ? ["roles/spanner.databaseUser"] : [].

gemini-code-assist · 2026-04-30T18:09:27Z

  description = "Spanner units for DCP"
  type        = number
-  default     = 100
+  default     = 1000


Increasing the default dcp_spanner_processing_units from 100 to 1000 (equivalent to 1 full Spanner node) significantly increases the baseline cost of the deployment. Environment variables that are intentionally specific to one service should be defined directly within that service's configuration rather than in a shared configuration file. If this capacity is only required for specific high-load ingestion scenarios, consider keeping a lower default and overriding it via environment-specific variables where needed.

References

Environment variables that are intentionally specific to one service should be defined directly within that service's configuration rather than in a shared configuration file.

gemini-code-assist · 2026-04-30T18:09:27Z

+resource "google_storage_bucket_iam_member" "dataflow_bucket_access" {
+  bucket = google_storage_bucket.data_bucket.name
+  role   = "roles/storage.objectAdmin"
+  member = "serviceAccount:${local.name_prefix}dcp-ingestion-sa@${var.project_id}.iam.gserviceaccount.com"


Constructing the service account email as a string literal creates a tight coupling between the cdc and dcp modules. This approach is brittle and will break if the naming convention or resource ID in the dcp module changes. It is better to define an input variable for this service account email in the cdc module and pass the value from the dcp module's output in the root main.tf.

gemini-code-assist · 2026-04-30T18:09:27Z

+
+resource "google_storage_bucket_iam_member" "cdc_data_bucket_access" {
+  bucket = google_storage_bucket.data_bucket.name
+  role   = "roles/storage.objectAdmin"


Using roles/storage.objectAdmin grants broad permissions, including the ability to manage object ACLs and IAM policies. Following the principle of least privilege, roles/storage.objectUser is generally preferred for ingestion workflows as it provides full control over objects (create, read, update, delete) without administrative overhead on the bucket or object permissions.

role = "roles/storage.objectUser"

gmechali added 3 commits April 24, 2026 16:07

Small changes to spanner deployment.

89cf7df

TF Version upgrade.

9ef4065

Workflow name in CDC data job

9996429

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

gmechali requested a review from dwnoble April 30, 2026 18:16

dwnoble approved these changes May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DCP Ingestion] Connect the CDC Data Job to the ingestion workflow#37

[DCP Ingestion] Connect the CDC Data Job to the ingestion workflow#37
gmechali wants to merge 3 commits intodatacommonsorg:mainfrom
gmechali:workflowname

gmechali commented Apr 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmechali commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmechali commented Apr 30, 2026 •

edited

Loading