Skip to content

RushirajPathak/Azure-Databricks-Data-Lake-Integration-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Azure Databricks Data Lake Integration Project

πŸ“˜ Overview

This project demonstrates a complete end-to-end data engineering pipeline using Microsoft Azure Databricks and Azure Data Lake Storage (ADLS Gen2).
It covers data ingestion, transformation, and visualization using Apache Spark (PySpark) in the Databricks environment.
The workflow follows an ELT (Extract, Load, Transform) pattern and is implemented using Databricks notebooks integrated with Azure services.


πŸš€ Project Objectives

  • Set up and configure Azure Databricks Workspace & Cluster
  • Create and connect to Azure Data Lake Storage (ADLS Gen2)
  • Ingest raw data (sales.csv) from a source container
  • Perform data cleaning and transformation using PySpark
  • Write transformed data back to a destination container in ADLS
  • Register the processed data as a Delta Table in Databricks
  • Query and visualize aggregated results directly within Databricks

πŸ—οΈ Architecture Diagram

Azure Portal
β”‚
β”œβ”€β”€ Resource Group (RG_DataLake)
β”‚    β”œβ”€β”€ Storage Account (ADLS Gen2)
β”‚    β”‚     β”œβ”€β”€ Container: source
β”‚    β”‚     └── Container: destination
β”‚    └── Databricks Workspace
β”‚          β”œβ”€β”€ Cluster (Single Node)
β”‚          β”œβ”€β”€ Notebook (PySpark code)
β”‚          └── Metastore (Delta Table)
β”‚
└── Visualization: Databricks Dashboards (Bar + Pie Charts)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published