This project implements a data pipeline using industry-standard tools such as dbt, Snowflake, and Airflow. It facilitates the extraction, loading, and transformation (ELT) process of data, enabling analytics and reporting within the organization.
-
Tools Used:
- Snowflake: Cloud-based data warehousing platform.
- dbt (Data Build Tool): SQL-based transformation and modeling tool.
- Airflow: Workflow orchestration platform.
-
Data Modeling Techniques:
- Fact tables, data marts.
- Snowflake Role-Based Access Control (RBAC) concepts.
-
Snowflake Environment Setup:
- Create Snowflake accounts, warehouse, database, and roles.
- Define necessary schemas for staging and modeling.
-
Configuration:
- Update
dbt_profile.yamlwith Snowflake connection details. - Configure source and staging files in the
models/stagingdirectory. - Define macros in
macros/pricing.sqlfor reusable calculations. - Configure generic and singular tests for data quality.
- Update
-
Airflow Deployment:
- Update Dockerfile and requirements.txt for Airflow deployment.
- Add Snowflake connection details in Airflow UI.
- Create a DAG file (
dbt_dag.py) to orchestrate dbt jobs.
models/: Contains dbt models for staging, intermediate tables, and fact tables.macros/: Contains reusable SQL macros for calculations.tests/: Contains SQL scripts for generic and singular tests.dbt_dag.py: Airflow DAG configuration file.
-
Clone the repository:
git clone https://github.com/your_username/data-pipeline.git
-
Set up Snowflake environment and configure necessary files.
-
Deploy Airflow with Docker and configure connections.
-
Start the Airflow scheduler and webserver:
docker-compose up -d- Access Airflow UI and trigger the dbt_dag DAG for execution.