This project demonstrates a complete end-to-end data engineering pipeline using Microsoft Azure Databricks and Azure Data Lake Storage (ADLS Gen2).
It covers data ingestion, transformation, and visualization using Apache Spark (PySpark) in the Databricks environment.
The workflow follows an ELT (Extract, Load, Transform) pattern and is implemented using Databricks notebooks integrated with Azure services.
- Set up and configure Azure Databricks Workspace & Cluster
- Create and connect to Azure Data Lake Storage (ADLS Gen2)
- Ingest raw data (
sales.csv) from a source container - Perform data cleaning and transformation using PySpark
- Write transformed data back to a destination container in ADLS
- Register the processed data as a Delta Table in Databricks
- Query and visualize aggregated results directly within Databricks
Azure Portal
β
βββ Resource Group (RG_DataLake)
β βββ Storage Account (ADLS Gen2)
β β βββ Container: source
β β βββ Container: destination
β βββ Databricks Workspace
β βββ Cluster (Single Node)
β βββ Notebook (PySpark code)
β βββ Metastore (Delta Table)
β
βββ Visualization: Databricks Dashboards (Bar + Pie Charts)