ETL-PROJECT-PYTHON-GCP

This project contains a production-ready ETL pipeline built in Python and designed to run on Google Cloud Platform using BigQuery as the main storage layer. The pipeline focuses on Brazilian public financial data from SIAFEM and SIGEO systems, supporting ingestion from CSV files or HTTP APIs, transformation of raw values into standardized metrics and loading into curated tables for analytics.

The main goal is to provide a clean and reproducible data engineering flow that can be reused as a template for similar financial data projects.

What this project does

Reads raw financial data from SIAFEM and SIGEO systems via APIs or local files
Cleans and standardizes column names and numeric types
Validates budget execution, expenditure hierarchies and financial dimensions
Calculates helper metrics such as budget utilization rates and monthly variance
Validates required fields before loading
Loads the final dataset into a BigQuery table

How to run

Create a service account in GCP with BigQuery permissions and download the JSON key.
Set the environment variables:
- GCP_PROJECT_ID
- GCP_DATASET_ID
- GCP_TABLE_ID
- SOURCE_FILE (path to the CSV file, if used)
- SIAFEM_API_URL (optional: SIAFEM API endpoint)
- SIGEO_API_URL (optional: SIGEO API endpoint)

Install the dependencies and run the pipeline:

pip install -r requirements.txt
python main.py

Or run everything with Docker:

docker build -t etl-project-python-gcp .
docker run \
  -e GCP_PROJECT_ID=your-project \
  -e GCP_DATASET_ID=financial_metrics \
  -e GCP_TABLE_ID=budget_execution \
  -e SOURCE_FILE=/app/data/input.csv \
  -e SIAFEM_API_URL=https://api.siafem.gov.br/dados \
  -e SIGEO_API_URL=https://api.sigeo.gov.br/dados \
  -v /path/to/credentials.json:/app/credentials.json \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
  etl-project-python-gcp

Why this repository exists

I created this pipeline to automate the ingestion and standardization of economic indicators from various Brazilian government agencies and financial institutions while working for the Government of São Paulo.

This is the third version. I've incorporated improvements using Open Source technologies and what I was using at the time. I like to keep everything documented, like a personal data engineering library.

The project was used to reduce manual reconciliation of budget execution metrics and to feed dashboards and forecasting models with consistent financial series based on real work in data engineering.

Quick Start with Google Colab

You can run this pipeline interactively using Google Colab without setting up your local environment:

Open notebooks/ETL_Pipeline_Colab.ipynb in Google Colab
Follow the notebook cells to authenticate with GCP and run the pipeline
Visualize and analyze the results directly in Colab

This is perfect for data exploration, testing, and prototyping before running in production.

Made with ☕ by Isabel Cruz

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL-PROJECT-PYTHON-GCP

What this project does

How to run

Why this repository exists

Quick Start with Google Colab

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL-PROJECT-PYTHON-GCP

What this project does

How to run

Why this repository exists

Quick Start with Google Colab

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages