[DCP Ingestion] Connect the CDC Data Job to the ingestion workflow#37
[DCP Ingestion] Connect the CDC Data Job to the ingestion workflow#37gmechali wants to merge 3 commits intodatacommonsorg:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the Google provider version, configures environment variables for Cloud Run jobs to support workflows, and modifies Spanner instance settings including the edition and processing units. Feedback highlights a logic error in conditional IAM role assignment for Spanner, suggests reducing default processing units to manage costs, recommends decoupling modules by avoiding hardcoded service account strings, and advises using more restrictive storage roles to follow the principle of least privilege.
| "roles/spanner.databaseUser", | ||
| "roles/workflows.invoker" |
There was a problem hiding this comment.
The logic for conditionally granting roles/spanner.databaseUser appears to be incorrect. The role is currently included in the static list and then concatenated again if var.use_spanner is false, meaning it is granted regardless of the setting. If Spanner access should be conditional, the role should be removed from the static list and the ternary logic should be corrected to var.use_spanner ? ["roles/spanner.databaseUser"] : [].
| description = "Spanner units for DCP" | ||
| type = number | ||
| default = 100 | ||
| default = 1000 |
There was a problem hiding this comment.
Increasing the default dcp_spanner_processing_units from 100 to 1000 (equivalent to 1 full Spanner node) significantly increases the baseline cost of the deployment. Environment variables that are intentionally specific to one service should be defined directly within that service's configuration rather than in a shared configuration file. If this capacity is only required for specific high-load ingestion scenarios, consider keeping a lower default and overriding it via environment-specific variables where needed.
References
- Environment variables that are intentionally specific to one service should be defined directly within that service's configuration rather than in a shared configuration file.
| resource "google_storage_bucket_iam_member" "dataflow_bucket_access" { | ||
| bucket = google_storage_bucket.data_bucket.name | ||
| role = "roles/storage.objectAdmin" | ||
| member = "serviceAccount:${local.name_prefix}dcp-ingestion-sa@${var.project_id}.iam.gserviceaccount.com" |
There was a problem hiding this comment.
Constructing the service account email as a string literal creates a tight coupling between the cdc and dcp modules. This approach is brittle and will break if the naming convention or resource ID in the dcp module changes. It is better to define an input variable for this service account email in the cdc module and pass the value from the dcp module's output in the root main.tf.
|
|
||
| resource "google_storage_bucket_iam_member" "cdc_data_bucket_access" { | ||
| bucket = google_storage_bucket.data_bucket.name | ||
| role = "roles/storage.objectAdmin" |
There was a problem hiding this comment.
Using roles/storage.objectAdmin grants broad permissions, including the ability to manage object ACLs and IAM policies. Following the principle of least privilege, roles/storage.objectUser is generally preferred for ingestion workflows as it provides full control over objects (create, read, update, delete) without administrative overhead on the bucket or object permissions.
role = "roles/storage.objectUser"
Passes in the environment variables relating to the ingestion workflow into the CDC data job.
Ensures the SA running CDC data Job can trigger the workflow, and the SA running dataflow can write to the GCS bucket.