Prithu Barnwal
- Mar 20, 2022
- 6 min read

Creating a Basic DQM Framework

Why do we even need one - When I started working for a retail major client , for the first 3 months we had to prepare 10 data sets post doing all the transformations - these had to be fed to our analytics team from our org and client’s org - they were supposed to be refreshed daily. There were 4 persons in the team, including me as a lead - per person around 3 datasets. As far as the client is concerned, the team was brand new - only one of them had prior experience. We created the datasets in adherence to our timelines.

During these and next few months, if there were issues in the pipeline, we were able to trace them and perform the patches quickly. It was easy to keep track of when every person had around 3 datasets under him/her - plus there was a lead in the project to centrally monitor all of this. As time went further, 10 more datasets’ preparation fell into our basket for the next 2 months and then, 10 more in the next 2.Things now , started to become more complex - apart from development of pipelines for new datasets, we had to take care of the maintenance of the older datasets/pipelines as well.

There were regular unseen issues -

source table not updated on time,

Image expired

source table having corrupted data , etc.

Transformation process gone wrong for corner cases

Pipeline Failure Alerts were already in place - but we needed an alert and failure identification mechanism at data level. For ex- The no of returns came out to be 10 % this week while it always used to be in the range of 22-25 %. The reason was that one of the source tables did not have data for the whole week. For each of these, things can be caught up in the ETL pipeline as well but we can not put a validation function for everything as it would increase the pipeline run time plus every check should not be accommodated for. Some of these errors do not even occur frequently - they come once in a while.

Resolution - The solution for the example could be implemented in the pipeline itself - we could use a condition check “if the max date in the table does not contain the last date in the last week” .

Downside -

If we know of an error in the middle of a pipeline run about some table not getting updated, we would be forced to keep our cluster running while the next set of operations would be put on hold until we fix the corresponding source table pipeline - costs will keep on building up. It would be great if we would have known of the issue before hand.

Every check should not be a part of the main pipeline - functionalities should be independent of each other. It would increase runtime plus most of these issues come once in a while.

There would always be some issues that would not cover all of our scenarios.

There was a need for a system , where in a day we could be alerted by any corruption in the quality of the data.

What can DQM resolve - what are its foremost properties -

Uniqueness
Timeliness
Accuracy
Consistency
Validity
Completeness

Preparing a full-fledged DQM framework is time consuming and needs all the developers and stakeholders' support. And there are times that the client would consider it a waste of time and de-prioritize it from time to time. While there would be an escalation , all the guns would come out blazing at the offshore team. For the next few days, everybody would talk of prioritizing the DQM framework and one week later, the same set of seniors will ask us to de-prioritize it .I would recommend the stakeholders, the managers and the leads to build a most basic version of framework as soon as the number of datasets increases if you don’t want to have a heart attack, time to time.

Let us Zoom in on the most common issues at output/target side -

The table ‘s level not being maintained - The level of the table is order id but it is not getting maintained
The table date columns are not updated to the latest date - Many a times we check the main date column of the table( eg. - run date, order date , etc) but we forget to check the other not-so-important columns in the table - return date, cancel date etc.
Some of the metrics are not following usual trends - For example, the return % should be around 20% on weekdays but this time came to be a mere 5% - it could be because of some error. It can be a genuine issue as well in a few cases.
Some of the columns having nulls more than the allowed percentage

Dealing with each of the above mentioned issues need implementation of the following points -

1. The table ‘s level not being maintained (Uniqueness)- For every table, we already know the level. If post completion , we run a script which checks the level of the table for all the target tables. It can be a simple group by statement.

Input Base Table

Summarized DQM Output Table -

2. The table date columns are not updated to the latest date (Timeliness) - For any table, list all the date columns in a column, and another column should have values corresponding to what those date columns should contain (rundate or rundate-1).

Input Table -

OutPut Table -

Timeliness Output Table -

Summarized DQM Output Table -

3. Some of the metrics are not following usual trends ( Accuracy) -

Consider a metric , no_of_orders at day level for a table called t. If we create a view of no_of_orders at day level , we will see a usual pattern in increase/decrease of no of orders on weekdays and weekends.

Based on this view , we can say that no_of_orders increase between 15% to 25% on weekdays and around 47% to 52% on weekends. Now, if on any day , the % increase is 78% , we can trigger the failure in the output table for that day.

For every important metric on the table , we can decide on the upper and lower limits - we can hardcode them or we can calculate them daily or weekly (based on how aggressive our framework is). IQR can be used to calculate the upper and lower limits.

Input table -

Output Tables -

Accuracy Output Table (can be partitioned on tablename, run_date) -

Summarized DQM Output Table -

4. Some of the columns having nulls more than the allowed percentage (Completeness) -

We can decide for each column the max null % occupancy. For example - the return no will be filled in mostly 14 out of 1oo orders - so, for this column , null % can not be more than 14 %. Or we can decide on the data set level - what would be the null max %.The simplest way would be to define a number at the project level. For example, in many projects 90% was set as the maximum null %.

Output Tables -

Completeness Output table(can be partitioned on tablename, run_date) -

Summarized DQM Output Table -

To summarize , for implementing the most basic version of Data Quality Framework , we can have the following schema to begin with -

No Of Input tables - 1

Input DQM Table

This table will have table name and metric as its primary key.If there are more than one metric for a given table , the rows can be repeated - or we can have a separate table for Metric as well.

This table will form the basis of other calculations

Output tables

No of Output tables required -4

DQM Summarized Output Table (At table and run date level)
Timeliness Output table
Accuracy Output Table
Completeness Output Table

Note : This is the basic version of DQM framework - it can be evolved to cover multiple scenarios. If DQM frameworks marks something as failure , it can be a failure or it can be genuine case. For Example - During Diwali, the no of orders can hike by a lot at day level.DO not treat DQM as the validation framework. The results also depend on which assumptions we have built this framework.

Enjoy the crops cultivated in this mind farm.

MSP Price- Subscription to the blog

Fictional, Romance, Inspiring, Supernatural touch

Creating a Basic DQM Framework