Skip to content

jotstolu/Car-Sales-End-to-End-Data-Engineering-Project-using-Azure-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸš— Car Sales End-to-End Data Engineering Project using Azure & Databricks

This project presents a scalable end-to-end data pipeline designed for processing and analyzing car sales data using the Azure Cloud and Databricks ecosystem.


🧭 Project Overview

The pipeline ingests car sales data from a github repository into Azure SQL Database, processes it through a multi-layered transformation pipeline in Databricks, and outputs a structured star schema suitable for downstream reporting.

It automates:

  • Data ingestion from GitHub β†’ Azure SQL DB
  • ETL transformations using Delta Lake (Bronze β†’ Silver β†’ Gold)
  • Dimensional modeling with fact and dimension tables
  • Visualization/reporting via Power BI

πŸ—ΊοΈ Architecture

The architecture follows a modern lakehouse pattern, enabling scalable and secure data engineering.


βš™οΈ Azure Data Factory Pipeline

Azure Data Factory Pipeline

This pipeline automates the ingestion phase, pulling data from GitHub and loading it into Azure SQL Database.

  • Lastload and current_load lookups help manage incremental load logic.
  • Data is copied into the Bronze Layer (raw zone).
  • A stored procedure updates the watermark after successful ingestion.

πŸ” Databricks Workflow & Data Model

Databricks Data Model Pipeline)

The Databricks pipeline implements the medallion architecture using Delta Live Tables:

Layer Description
Silver Cleaned and normalized data from the bronze layer
Gold Dimension and fact tables ready for analytics/reporting

🧱 Notebook Structure

The project consists of the following transformation notebooks:

cars_sales_project/
β”œβ”€β”€ Silver_Notebook.python         # Initial silver layer transformations
β”œβ”€β”€ Gold DimBranch.python          # Branch dimension table
β”œβ”€β”€ Gold DimDate.python            # Date dimension table
β”œβ”€β”€ Gold DimDealer.python          # Dealer dimension table
β”œβ”€β”€ Gold DimModel.python           # Model dimension table
β”œβ”€β”€ Gold_fact_sales.python         # Fact table for car sales

Each notebook processes data from the silver layer and applies business logic for gold layer generation.


πŸ› οΈ Technologies Used

  • Azure Data Factory – Orchestrates data movement and ingestion
  • Azure SQL Database – Temporary data storage post-ingestion
  • Azure Data Lake Gen2 – Scalable and secure data storage layer
  • Databricks (Delta Live Tables) – Data transformation & processing
  • Delta Lake Format – Supports ACID transactions & versioning
  • Power BI – Reporting and visualization layer
  • GitHub – Source control for dataset and notebooks

πŸ“Š Output: Star Schema

The Gold Layer produces a Star Schema:

  • Central Fact_Sales table
  • Dimensions: Dim_Branch, Dim_Dealer, Dim_Model, Dim_Date

About

This project presents a scalable end-to-end data pipeline designed for processing and analysing car sales data using the Azure Cloud and Databricks ecosystem.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors