What is Data Matching and how to do it

What is data matching?

Data matching is the process of identifying and merging duplicate data records. This can be done across databases to ensure matching data is aligned.

In previous data matching articles, we talked about the fundamentals of data matching, and both the art and the science of building matching rules based on the context of your end goal. In this final section, I want to discuss some of the more advanced aspects of record matching, and how they can provide business value. Data matching compares two sets of data to identify duplicate records.

What types of data matching are there?

Sometimes, data matching on a single set of criteria won’t show you all the possible relationships among your records. Look at these examples:

Record 1 Record 2 Record 3
John Smith John Smith John Smith
123 Main Street 123 Main Street 53 State Street
Amesbury, MA 01913 Amesbury, MA 01913 Boston, MA 02109
(555) 123-4567 (555) 999-9999 (555) 123-4567

Matching on name and address brings records 1 and 2 together. Matching on name and phone brings records 1 and 3 together. But to get a comprehensive view of the relationship (say a single individual with records at his home and business addresses, and a mixture of land line and cell phone content), you need to do both sets of matching and combine the results. Looking at your data under a variety of grouping methodologies can be important to getting a clear view of the relationships.

What is the difference?

Another aspect of data matching is the difference between probabilistic and deterministic matching. There are plenty of articles on the web that go into the details of what constitutes probabilistic matching vs deterministic matching. Let me summarize by saying that probabilistic matching is going to take some statistic into consideration as part of the match criteria: What is the average percentage of closeness among the compared elements? Or how unique are the items I’m comparing? (Smith vs Smith isn’t as relevant to a match as Schladenhauffen vs Schladenhauffen). Deterministic matching, on the other hand, is more straightforward based on the data comparisons. Certain fields need to pass certain comparison thresholds to get a match. If the fields don’t pass the right combination of thresholds, then you don’t get a match, or sometimes you get a “possible” match, which allows for further analysis and potential tweaking of the rules.

Pros and cons of probabilistic and deterministic data matching?

Based on these two types of data matching, you get some pros and cons. Probabilistic matching can be quicker to implement based on setting match percentages rather than detailed match combinations, but it also can average out unique situations (Say all your data is exactly the same except for the generation distinctions of Sr and Jr.—you want to make sure those records stay separate). Deterministic matching can allow for very granular process tuning, but requires much more testing to create all the match criteria. In some ways, a combination of both methods can help mitigate the cons and accentuate the pros!

The last item to mention is around consolidating your matched records into a single, master record (sometimes referred to as a “golden record”). This process of deduplication can be done in many ways, but the purpose is to attempt to create the most accurate, representative record of the original relationship you were trying to determine (household, individual, company, product, etc.).

Some people will simply take one record in the matched record relationship and call it the master record. This may be selected by having the highest account balance, the most recent open date, or some other user-defined criteria. In this case, whatever data is contained on that record becomes the “single version of the truth” for that relationship. Another strategy is to look at each field independently across the records in the relationship, and select the “best” data in each case. This may mean taking the last name from the most recently opened record, or taking the first name from the longest populated first name field, or by taking the phone number from the one that occurs the most in the matched group.

There’s no right or wrong answer to the best way to consolidate the information. It will be based (again) on your goals as a business and how you will be using the information in the future.

This concludes our three-part series into data matching. As you can see, there are a lot of things to consider to make a data matching process work for your business, but it is worth the effort as it can be the most important piece to a complete view of your customers and a true data quality methodology in your organization!

Are you ready to get started?

Yes, I want a complete customer view