System Design and my solution for the problem

Dummy data preparation

Since no data was provided for this task, I used a Kaggle data set located here and modified it according to my needs in order to map it to the given coding assignment.

Relevant notebooks are located in the data_prep directory. For some of the steps, I have sampled a subset from the previous step in order to simulate a drop-off on each of the step. You can look at the notebook at data_prep/data_prep.ipynb. data_prep/sample_db_creator.ipynb prepares a SQLite DB for OLTP.

1) Data Collection and Frontend and Backend componenets

For the scenario described in the problem, I have assumed that some sort of event tracking is implemented in the OLTP database as follows:

A very basic preview of the event object is given below:

2) How and where would you store the data

I am envisioning this data pipeline as a distributed system that takes care of not overwhelming OLTP databases and hence the user-facing products. My idea of this pipeline and data storage is inline with the following:

Airflow DAG dependencies

It's worth noticing that modeling jobs have to wait until the Data Lake extraction is complete. To make that possible, I have modeled DAGs in Airflow to take care of that using ExternalTaskSensor which waits for the previous DAG to complete before it starts the downstream DAG. I find this technique very useful for modeling data flows like this.

image-2.png image-3.png image.png

System diagram

image-2.png

3) How will you generate Funnel Analysis

Following queries show some funnel analysis that I could think of using the dimension model tables stored in what I call the Datawarehouse. Headings are self-explanatory for what I am trying to do in the analysis.

Adding and modifying columns before generating tables in warehouse

Before loading tables in the warehouse, I modified and generated a few columns based on the funnel analysis I was expecting to do. This is where I believe a raw Data Lake shines out in the sense that teams can create tables in warehouse according to their needs.

Load tables in Spark

Page visit trend for step 6 by sex, channel and month

Percentage drop-off from step 5 to step 6 by sex and channel

Percentage drop-off from step 4 to step 5 by sex, channel and month

4) Assumptions made for the data pipeline

5) Tech used

6) Directories worth reviewing