On a recent presentation I delivered as a guest of Experian Data Quality I was asked the following question:
“What is the business case for using Data Quality tools over a SQL based approach?”
This is one of those timeless questions I get asked frequently so I wanted to provide a series of steps to help you arrive at the right answer.
Step 1: Define Your Data Quality Requirements
This first step sounds obvious but it’s a common pitfall of organisations to base their data quality efforts around whatever tools and skills they have available as opposed to what they actually need to deliver a data quality strategy.
For example, a lot of SQL developers can hand-code simple data profiling routines for statistics such as column completeness, foreign key violations and even pattern analysis. When people see this they often question the need for investment in specialist profiling tools.
Data profiling is an important activity but only makes up one piece of the data quality puzzle of course. The purpose of data profiling is to discover data quality rules and this is where SQL fails miserably. SQL only really adds value when you know what columns you wish to deal with and which rules to apply. Data quality tools allow you to discover rules across thousands of attributes, hundreds of tables and many systems.
For example, you might need capabilities such as:
Whilst these might appear excessive at the outset you have to ask yourself:
"What are we trying to achieve now and for the long term?"
Don't just focus on basic checks like simple column statistics. Think about broader data quality tasks because 9 times out of 10 what you actually need just isn't feasible with SQL.
Step 2: Create a Requirements vs Capability Matrix
Now you need to list your core requirements on one axis and the product options on the other using a spreadsheet. Obviously an SQL approach will be one of the solution columns.
I like to use a scoring system for each feature:
You can adopt a colour scheme for each value to create a simple visual tool to observe which solution has the least gaps. I typically use a Red, Amber and Green system:
This also allows you to score each solution by creating a total at the end of each column. You can adjust the scoring system by introducing a weighted factor for the criticality of each feature.
Step 3: Create a Justification Description for Each Gap
Where you find a feature (e.g. automatic relationship discovery) that can't be achieved with SQL you can bet that someone will push back with the question:
"Do we really need this requirement?"
I find it helps to prepare for this inevitable rebuttal by building a case for each feature. In the example of automatic relationship discovery you could say something like:
"We have inherited a poorly documented, silo driven IT landscape. We know that our service depends on functions working well between systems, we just don't know how these functions are linked between systems or how bad the data is that supports them.
Using a relationship discovery function, we can not only find and document these missing links but also quantify their health, helping us find areas for improvement.
What's more we can automatically monitor the data quality of these links moving forward.
SQL fails to provide any of these functions at a scale and performance that is practical.
In my experience, when you list out all the functions required of a data quality tool and then score your SQL approach against this matrix you will soon realise just how impractical and ill-equipped SQL is for delivering data quality improvement and management projects.
Everyone knows that SQL is a poor choice for reasons such as:
However, the biggest failure I see with SQL is that it relies on the developer knowing roughly where the problems lie. It lacks the ability to measure and improve data quality on a much broader, enterprise scale.
Using my capability matrix idea above you should be able to create a clear justification for dropping SQL for a more practical and long-term solution.
The final proof comes from implementing a pilot project and comparing a data quality tool against the SQL approach. When you do this I guarantee you’ll ‘put a nail in the SQL coffin’ once and for all because it will be a serious mismatch.
The business in particular will not want to adopt an outdated method of hand-coding scripts when they’ve witnessed how much control and productivity they gain using specialist data quality tools.
Copyright ©, 2014-2017. All rights reserved.
Experian Ltd is authorised and regulated by the Financial Conduct Authority. Experian Ltd is registered in England and Wales under company registration number 653331.
Registered office address: The Sir John Peace Building, Experian Way, NG2 Business Park, Nottingham NG80 1ZZ.