If you are a data quality professional then you have more than likely heard the terms Data Lake, Data Swamp, Data Ocean, Data Pond and even Data Puddle. In fact, stick the word ‘data’ in front of any word used to name a body of water and you’ve more than likely found a commonly used term in the industry (although I have yet to hear of a Data Paddling Pool’).
My point is that we are surrounded by similar sounding terms, from broader areas such as data quality, data governance and data profiling – to more specific expressions such as prototyping and reconciliation. Each has their own nuance and they can be easily confused, but it is important to know the difference between them. This is increasingly important with data strategies commonly now providing the foundations to everyday business decisions. This means more non-technical staff need to learn the language but in plain English. As the gatekeeper of our ever-growing Glossary section, I have picked out some of the most commonly mistaken terms – and with help from our team of experts, I’ve explained how we define them.
We have a blog that explores the differences between data quality and data governance in more detail but the key thing to know is that although they are most definitely related – they are separate data management disciplines.
Data quality specifically refers to the accuracy and integrity of any given dataset including its completeness, how up to date it is and therefore how useful it is. Considering data is used to make strategic decisions, create automated decision-making systems and much more - it's a key consideration for any company. Data governance on the other hand, refers to how your data is controlled and managed. This covers the delegation of responsibilities (including roles such as Data Controllers and Data Stewards) and how to proactively look after the data going forward.
The reason these two disciplines are often confused is because you can’t have an effective data quality strategy without a well implemented data governance strategy and vice versa. Whilst a large-scale effort of address, email and mobile cleansing, for example, will improve your data quality for a certain period, it is then data governance that will maintain this integrity over time. Meanwhile a well organised governance initiative becomes a pointless task if the data that the people, processes and policy relate to is of poor quality in the first place.
They may sound similar but these two terms each have a very specific meaning. So if you’re looking at Data Management platforms it’s important to be clear on what each will deliver.
Data profiling is the overarching term that describes finding the content, structure and relationships in a given data source using statistical processing. This could include a number of separate activities including data discovery as well as data reconciliation and impact analysis.
In layperson’s terms, data profiling involves analysing your data through reports (often using dedicated data profiling software). The aim of this is discovering any useful insights that could help to drive strategic decisions from the data or help you to create validation rules to help keep the data clean.
Data discovery on the other hand, is a buzzword used in the world of Business Intelligence. For the most part it involves using graphical tools such as charts, maps and pivot tables to explore pre-prepared data sets for useful patterns or specific data items. The key difference here is the concentration on the use of interactive reports and the finding of patterns to drive goals.
Another couple of terms that are linked but definitely different are data prototyping and data reconciliation. Data prototyping is where one or several sources of data are transformed, with the aim of creating a high quality resulting data source – without affecting any operational systems. This will be tested using transformation rules and helps the final data source to become more agile and more able to have data profiling activities used on it.
A common situation where data prototyping is useful is during Data Migrations, where the migration analyst can see what happens when multiple sets of data are consolidated, allowing them to adapt the data for a smooth data migration.
Data reconciliation is another activity commonly involved with Data Migrations where the aim is to verify data before, and after an activity to ensure that all original records and values are present and correct after the work has been done. Used as part of projects that involve large amounts of data, this is incredibly important since missing information is easily overlooked and will damage your data’s integrity without you even knowing.
Data Migration and Quality technology is often used for this process where they can use other profiling activities (such as prototyping and discovery) to help identify any missing data through reports.
Now, this is a big one. The differences between the outgoing Data Protection Act (DPA) and the incoming General Data Protection Regulations (GDPR) are significant. We have another blog that explains this in more detail but the important thing to remember is that, at its core, the GDPR is all about putting the customer at the centre of your data strategy.
Whilst getting ready for the deadline to comply in May 2018 may require a certain amount of preparation – those who get ahead of the game will benefit from the opportunities that transforming their data quality strategy will provide. For more on this read our blog about how the GDPR could boost your business.
Hopefully these definitions have helped you see the wood for the trees in our fairly jargon-heavy industry. Our Data Quality Glossary is ever growing so if you don’t know your X from your Y it’s a great resource to get to the bottom of any other phrases or terms you hear. And if we haven’t defined it – let us know in the comments below or in the suggestions box and we can help you out.