Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,8 @@ infra/dcp/.env
*.pyc
.venv/
.env
*.generated.yaml
*.generated.yaml

AGENTS.md
docs/plans
docs/designs
18 changes: 18 additions & 0 deletions infra/dcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,24 @@ gcloud workflows run <namespace>-ingestion-orchestrator \

## Architecture & Troubleshooting

### Modular Structure
Stack composition is delegated to `modules/stack`, which manages smaller, dedicated sub-modules for various components of both the CDC and DCP stacks.

### Module Overview
* **`stack`**: Orchestrates sub-modules based on feature toggles ([modules/stack](file:///Users/dwnoble/Projects/datacommons/datacommons/infra/dcp/modules/stack/main.tf)).
* **`cdc_data_ingestion_job`**: Ingestion Cloud Run v2 Job.
* **`cdc_iam`**: IAM and Secret Manager config for CDC.
* **`cdc_mysql`**: Cloud SQL MySQL instance and databases.
* **`cdc_network`**: VPC and serverless access connectors.
* **`cdc_redis`**: Memorystore Redis instance.
* **`cdc_services`**: CDC Cloud Run v2 web services ([modules/cdc_services](file:///Users/dwnoble/Projects/datacommons/datacommons/infra/dcp/modules/cdc_services/main.tf)).
* **`dcp_ingestion_dataflow`**: Dataflow runner service account and IAM for DCP ([modules/dcp_ingestion_dataflow](file:///Users/dwnoble/Projects/datacommons/datacommons/infra/dcp/modules/dcp_ingestion_dataflow/main.tf)).
* **`dcp_ingestion_helper`**: Helper Cloud Run service for DCP ([modules/dcp_ingestion_helper](file:///Users/dwnoble/Projects/datacommons/datacommons/infra/dcp/modules/dcp_ingestion_helper/main.tf)).
* **`dcp_ingestion_workflow`**: Cloud Workflows for orchestration.
* **`dcp_service`**: DCP Cloud Run service.
* **`storage`**: GCS buckets for both CDC and DCP stacks ([modules/storage](file:///Users/dwnoble/Projects/datacommons/datacommons/infra/dcp/modules/storage/main.tf)).
* **`spanner`**: Shared Cloud Spanner instance and databases.

### Orchestrator Pattern
The ingestion pipeline uses Google Cloud Workflows as an orchestrator. It receives the ingestion parameters, names the Dataflow job with a timestamp, launches the Dataflow Flex Template, and returns the job status. This prevents direct interaction with complex Dataflow APIs for standard ingestion tasks.

Expand Down
176 changes: 80 additions & 96 deletions infra/dcp/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ terraform {
source = "hashicorp/null"
version = ">= 3.0"
}
random = {
source = "hashicorp/random"
version = ">= 3.0"
}
}
}

Expand All @@ -25,7 +29,6 @@ provider "google-beta" {
billing_project = var.billing_project_id != null ? var.billing_project_id : var.project_id
}

# Enable required APIs for both stacks
resource "google_project_service" "apis" {
for_each = toset(concat([
"apikeys.googleapis.com",
Expand All @@ -39,113 +42,94 @@ resource "google_project_service" "apis" {
"compute.googleapis.com"
], var.enable_dcp ? ["spanner.googleapis.com"] : [], var.dcp_deploy_data_ingestion_workflow ? [
"workflows.googleapis.com",
"workflowexecutions.googleapis.com"
"workflowexecutions.googleapis.com",
"dataflow.googleapis.com"
] : []))

service = each.key
disable_on_destroy = false
}

# --- Network Data Sources ---
data "google_compute_network" "default" {
name = var.cdc_vpc_network_name
}

data "google_compute_subnetwork" "default" {
name = var.cdc_vpc_network_subnet_name
region = var.region
}

# --- Data Commons Platform (DCP) Stack ---
module "dcp" {
source = "./modules/dcp"
count = var.enable_dcp ? 1 : 0

project_id = var.project_id
namespace = var.namespace
region = var.region
image_url = var.dcp_image_url
service_name = var.dcp_service_name
service_account_name = var.dcp_service_account_name
create_spanner_instance = var.dcp_create_spanner_instance
create_spanner_db = var.dcp_create_spanner_db
spanner_instance_id = var.dcp_spanner_instance_id
spanner_database_id = var.dcp_spanner_database_id
spanner_processing_units = var.dcp_spanner_processing_units
service_cpu = var.dcp_service_cpu
service_memory = var.dcp_service_memory
service_min_instances = var.dcp_service_min_instances
service_max_instances = var.dcp_service_max_instances
service_concurrency = var.dcp_service_concurrency
service_timeout_seconds = var.dcp_service_timeout_seconds
make_service_public = var.make_services_public
deletion_protection = var.deletion_protection
locals {
stack_shared = {
project_id = var.project_id
region = var.region
namespace = var.namespace
deletion_protection = var.deletion_protection
make_services_public = var.make_services_public
}

deploy_data_ingestion_workflow = var.dcp_deploy_data_ingestion_workflow
stack_toggles = {
enable_dcp = var.enable_dcp
enable_cdc = var.enable_cdc
}

stack_dcp = {
image_url = var.dcp_image_url
service_name = var.dcp_service_name
service_account_name = var.dcp_service_account_name
create_spanner_instance = var.dcp_create_spanner_instance
create_spanner_db = var.dcp_create_spanner_db
spanner_instance_id = var.dcp_spanner_instance_id
spanner_database_id = var.dcp_spanner_database_id
spanner_processing_units = var.dcp_spanner_processing_units
service_cpu = var.dcp_service_cpu
service_memory = var.dcp_service_memory
service_min_instances = var.dcp_service_min_instances
service_max_instances = var.dcp_service_max_instances
service_concurrency = var.dcp_service_concurrency
service_timeout_seconds = var.dcp_service_timeout_seconds
deploy_data_ingestion_workflow = var.dcp_deploy_data_ingestion_workflow
create_ingestion_bucket = var.dcp_create_ingestion_bucket
external_ingestion_bucket_name = var.dcp_external_ingestion_bucket_name
ingestion_lock_timeout = var.dcp_ingestion_lock_timeout
}

depends_on = [google_project_service.apis]
stack_cdc = {
dc_api_key = var.cdc_dc_api_key
maps_api_key = var.cdc_maps_api_key
disable_google_maps = var.cdc_disable_google_maps
google_analytics_tag_id = var.cdc_google_analytics_tag_id
gcs_data_bucket_name = var.cdc_gcs_data_bucket_name
gcs_data_bucket_input_folder = var.cdc_gcs_data_bucket_input_folder
gcs_data_bucket_output_folder = var.cdc_gcs_data_bucket_output_folder
gcs_data_bucket_location = var.cdc_gcs_data_bucket_location
mysql_instance_name = var.cdc_mysql_instance_name
mysql_database_name = var.cdc_mysql_database_name
mysql_database_version = var.cdc_mysql_database_version
mysql_cpu_count = var.cdc_mysql_cpu_count
mysql_memory_size_mb = var.cdc_mysql_memory_size_mb
mysql_user = var.cdc_mysql_user
vpc_connector_cidr = var.cdc_vpc_connector_cidr
vpc_network_name = var.cdc_vpc_network_name
web_service_image = var.cdc_web_service_image
web_service_min_instance_count = var.cdc_web_service_min_instance_count
web_service_max_instance_count = var.cdc_web_service_max_instance_count
web_service_cpu = var.cdc_web_service_cpu
web_service_memory = var.cdc_web_service_memory
data_job_image = var.cdc_data_job_image
data_job_cpu = var.cdc_data_job_cpu
data_job_memory = var.cdc_data_job_memory
data_job_timeout = var.cdc_data_job_timeout
enable_redis = var.cdc_enable_redis
redis_instance_name = var.cdc_redis_instance_name
redis_memory_size_gb = var.cdc_redis_memory_size_gb
redis_tier = var.cdc_redis_tier
redis_location_id = var.cdc_redis_location_id
redis_alternative_location_id = var.cdc_redis_alternative_location_id
redis_replica_count = var.cdc_redis_replica_count
search_scope = var.cdc_search_scope
enable_mcp = var.cdc_enable_mcp
}
}

# --- Custom Data Commons (CDC) Legacy Stack ---
module "cdc" {
source = "./modules/cdc"
count = var.enable_cdc ? 1 : 0
module "stack" {
source = "./modules/stack"

project_id = var.project_id
namespace = var.namespace
dc_api_key = var.cdc_dc_api_key
maps_api_key = var.cdc_maps_api_key
disable_google_maps = var.cdc_disable_google_maps
region = var.region
google_analytics_tag_id = var.cdc_google_analytics_tag_id
gcs_data_bucket_name = var.cdc_gcs_data_bucket_name
gcs_data_bucket_input_folder = var.cdc_gcs_data_bucket_input_folder
gcs_data_bucket_output_folder = var.cdc_gcs_data_bucket_output_folder
gcs_data_bucket_location = var.cdc_gcs_data_bucket_location
mysql_instance_name = var.cdc_mysql_instance_name
mysql_database_name = var.cdc_mysql_database_name
mysql_database_version = var.cdc_mysql_database_version
mysql_cpu_count = var.cdc_mysql_cpu_count
mysql_memory_size_mb = var.cdc_mysql_memory_size_mb
mysql_storage_size_gb = var.cdc_mysql_storage_size_gb
mysql_user = var.cdc_mysql_user
mysql_deletion_protection = var.deletion_protection
dc_web_service_image = var.cdc_web_service_image
dc_web_service_min_instance_count = var.cdc_web_service_min_instance_count
dc_web_service_max_instance_count = var.cdc_web_service_max_instance_count
dc_web_service_cpu = var.cdc_web_service_cpu
dc_web_service_memory = var.cdc_web_service_memory
make_dc_web_service_public = var.make_services_public
dc_data_job_image = var.cdc_data_job_image
dc_data_job_cpu = var.cdc_data_job_cpu
dc_data_job_memory = var.cdc_data_job_memory
dc_data_job_timeout = var.cdc_data_job_timeout
dc_search_scope = var.cdc_search_scope
enable_mcp = var.cdc_enable_mcp
vpc_network_name = var.cdc_vpc_network_name
vpc_network_subnet_name = var.cdc_vpc_network_subnet_name
enable_redis = var.cdc_enable_redis
redis_instance_name = var.cdc_redis_instance_name
redis_memory_size_gb = var.cdc_redis_memory_size_gb
redis_tier = var.cdc_redis_tier
redis_location_id = var.cdc_redis_location_id
redis_alternative_location_id = var.cdc_redis_alternative_location_id
redis_replica_count = var.cdc_redis_replica_count
vpc_connector_cidr = var.cdc_vpc_connector_cidr
vpc_network_id = data.google_compute_network.default.id
use_spanner = var.enable_dcp
spanner_instance_id = var.enable_dcp ? module.dcp[0].spanner_instance_id : ""
spanner_database_id = var.enable_dcp ? module.dcp[0].spanner_database_id : ""
deletion_protection = var.deletion_protection
shared = local.stack_shared
toggles = local.stack_toggles
dcp = local.stack_dcp
cdc = local.stack_cdc

depends_on = [google_project_service.apis]
}

# Ensure Spanner instance ID is provided when not creating a new one
check "spanner_instance_id_provided" {
assert {
condition = !var.enable_dcp || var.dcp_create_spanner_instance || var.dcp_spanner_instance_id != ""
error_message = "dcp_spanner_instance_id must be provided when reusing an existing instance (dcp_create_spanner_instance = false)."
}
}
109 changes: 0 additions & 109 deletions infra/dcp/modules/cdc/locals.tf

This file was deleted.

Loading
Loading