ARSHPREET
KAUR
BACK TO WORK
AZURE DATA ENGINEERING

Paris Olympics 2024 Data Engineering Pipeline

Stack
Azure Data FactoryDatabricksPySparkSynapse Analytics
Paris Olympics Data Pipeline Showcase

Project Overview

This project demonstrates the design and implementation of a full-scale data engineering pipeline using Microsoft Azure. Inspired by the global scale and complexity of the 2024 Paris Olympics, I simulated a production-grade workflow that mirrors how organizations ingest raw data, apply scalable transformations, and enable analytics and reporting.

Inspiration and Problem Context

When the Paris Olympics dominated global headlines, I challenged myself to answer a practical question: could I design a cloud-native data solution similar to what businesses use to track performance, trends, and outcomes at scale?

This project became a self-directed deep dive into modern Azure data services, with an emphasis on orchestration, scalability, and analytics readiness rather than one-off data processing.

Data Sources and Model

The pipeline ingests multiple structured datasets representing different aspects of Olympic participation and performance, including:

  • Athlete details by country, gender, and event
  • Coaching assignments linked by nationality
  • Gender participation by sport
  • Medal counts by country and discipline

Solution Architecture

1. Ingestion

  • Built parameterized Azure Data Factory pipelines to ingest datasets
  • Implemented scheduling and versioning logic for repeatable loads
  • Ensured raw data was stored securely in Azure Data Lake Gen2

2. Processing and Transformation

  • Used Azure Databricks with PySpark for scalable data cleaning
  • Applied feature engineering and normalization logic
  • Optimized huge datasets for performance

3. Analytics Layer

  • Modeled transformed data into a star schema using Azure Synapse Analytics
  • Created external tables for efficient SQL-based querying
  • Prepared datasets for downstream BI consumption

Business Insights Enabled

The final datasets support dynamic analysis of:

  • Medal efficiency by country
  • Gender distribution across sports
  • Nation-level performance trends

Impact and Outcomes

  • Transformed 5+ raw datasets into a structured analytical model
  • Achieved end-to-end pipeline execution under three minutes
  • Delivered clean, validated datasets ready for reporting
  • Built a scalable foundation supporting additional data sources