AZURE DATA ENGINEERING

Paris Olympics 2024 Data Engineering Pipeline

Stack

Azure Data FactoryDatabricksPySparkSynapse Analytics

Project Overview

This project demonstrates the design and implementation of a full-scale data engineering pipeline using Microsoft Azure. Inspired by the global scale and complexity of the 2024 Paris Olympics, I simulated a production-grade workflow that mirrors how organizations ingest raw data, apply scalable transformations, and enable analytics and reporting.

Inspiration and Problem Context

When the Paris Olympics dominated global headlines, I challenged myself to answer a practical question: could I design a cloud-native data solution similar to what businesses use to track performance, trends, and outcomes at scale?

This project became a self-directed deep dive into modern Azure data services, with an emphasis on orchestration, scalability, and analytics readiness rather than one-off data processing.

Data Sources and Model

The pipeline ingests multiple structured datasets representing different aspects of Olympic participation and performance, including:

Athlete details by country, gender, and event
Coaching assignments linked by nationality
Gender participation by sport
Medal counts by country and discipline

Solution Architecture

1. Ingestion

Built parameterized Azure Data Factory pipelines to ingest datasets
Implemented scheduling and versioning logic for repeatable loads
Ensured raw data was stored securely in Azure Data Lake Gen2

2. Processing and Transformation

Used Azure Databricks with PySpark for scalable data cleaning
Applied feature engineering and normalization logic
Optimized huge datasets for performance

3. Analytics Layer

Modeled transformed data into a star schema using Azure Synapse Analytics
Created external tables for efficient SQL-based querying
Prepared datasets for downstream BI consumption

Business Insights Enabled

The final datasets support dynamic analysis of:

Medal efficiency by country
Gender distribution across sports
Nation-level performance trends

Impact and Outcomes

Transformed 5+ raw datasets into a structured analytical model
Achieved end-to-end pipeline execution under three minutes
Delivered clean, validated datasets ready for reporting
Built a scalable foundation supporting additional data sources