System Design for Funnel Analysis¶

Dummy data preparation¶

I have used a Kaggle data set located here and modified it according to my needs in order to map it to the given coding assignment.

Relevant notebooks are located in the data_prep directory. For some of the steps, I have sampled a subset from the previous step in order to simulate a drop-off on each of the step. You can look at the notebook at data_prep/data_prep.ipynb. data_prep/sample_db_creator.ipynb prepares a SQLite DB for OLTP.

1) Data Collection and Frontend and Backend componenets¶

For the scenario described in the problem, I have assumed that some sort of event tracking is implemented in the OLTP database as follows:

• Whenever a logged in user visits a certain page in the funnel, an event is populated with its user id, date, sex, channel, device and the page. This happens on each of the page that is part of the funnel
• For now, I have assumed it to be a single product (for e.g. insurance policy purchase) but this can be extended to multiple products as well
• Ideally, I would want each event to be tracked via some sort of a cart session so multiple sessions for a user can be tracked but this isn't the case for now

A very basic preview of the event object is given below:

2) How and where would you store the data¶

I am envisioning this data pipeline as a distributed system that takes care of not overwhelming OLTP databases and hence the user-facing products. My idea of this pipeline and data storage is inline with the following:

• OLTP database such as MySQL that stores application data. This also stores funnel event data
• Spark jobs to batch load events from OLTP on a nightly basis and put them into parquet file format into what I call a Data Lake. All jobs are scheduled via Airflow. This is for the most part, raw unprocessed data stored in a file system analogous to Hadoop
• My implemented Spark jobs are not fully production ready. Ideally, I would add some partitioning and throttling on my Spark jobs to make sure they don't bring down the OLTP and are capable of extracting deltas rather than doing a full load of the table every time
• Data Lake enables us to offload load from OLTP and gives us a rather unbounded plane to extract, slice and dice data, rinse and repeat, etc.
• Once we have the data in Data Lake, we can model it to however we want. This can be based on a teams need. In my case, I am applying some basic dimension modeling (dimension_models package) and expanding it to make my queries simpler and easier to follow
• Modeling is performed by Spark jobs (scheduled via Airflow) that load data from Data Lake and generate new dimension model tables. My understanding is that these dimension models can then be loaded into Snowflake or Redshift so that downstream teams can slice and dice data. Sample queries on dimension models are shown later in this notebook

Airflow DAG dependencies¶

It's worth noticing that modeling jobs have to wait until the Data Lake extraction is complete. To make that possible, I have modeled DAGs in Airflow to take care of that using ExternalTaskSensor which waits for the previous DAG to complete before it starts the downstream DAG. I find this technique very useful for modeling data flows like this.

3) How will you generate Funnel Analysis¶

Following queries show some funnel analysis that I could think of using the dimension model tables stored in what I call the Datawarehouse. Headings are self-explanatory for what I am trying to do in the analysis.

Adding and modifying columns before generating tables in warehouse¶

Before loading tables in the warehouse, I modified and generated a few columns based on the funnel analysis I was expecting to do. This is where I believe a raw Data Lake shines out in the sense that teams can create tables in warehouse according to their needs.

• Converted date strings to date type using Spark udf
string_to_date = udf(lambda x: datetime.strptime(x, '%Y-%m-%d'), types.DateType())
• Added an extra column to mark a visit. Note that we can even add an aggregation to this fact table if we want to such that we get optimized queries when performing joins
step1_fact = step1_fact.withColumn("visited", lit(1))

4) Assumptions made for the data pipeline¶

• Input data is for a single load and the reason why I haven't shown any partitioning on the SELECT queries
• Input data is stored in SQLite. This is never going to be the case in production. This is just a prototype
• Dimension tables would ideally be stored using conventional data warehousing tools such as Snowflake or Redshift

5) Tech used¶

• Airflow for job scheduling (helm chart included for k8s deployment)