This blog post is part one of three in a mini-series we are calling The art and science of matching your data.
Matching is a term used commonly throughout data management, but it is also known by several other terms: linking, deduplication, joining, aggregation, and so on. For the purposes of this discussion, let’s define matching at the process in which I can determine a relevant association between two or more individual data records.
What is a “relevant association”? Well, the easiest examples are with contact data. A relevant association may be a household relationship, such as identifying two or more people living together at the same address. You may want to see an individual relationship; identifying two or more records related to a single person. There are other relationships as well—customer relationships, corporate relationships, franchise relationships, product relationships—pretty much anything that involves associating two or more data records together based on similar content in similar data fields.
Matching is arguably the most important process in data quality. All other activities lead to this end – standardizing, verifying, enriching – all provide benefits to your data, but are also the main ingredients to the most accurate matching process possible. Assuming you’ve gone through these efforts, and your data is in the best shape it can be, now is the time to get to matching.
First and foremost, we need to take some basic efficiency steps. If you’ve got 100s of millions of data records, you don’t necessarily want to compare every record to every other record. Not only would that be incredibly heavy on the processing side, but it also doesn’t make sense knowing that most records will have no association with most other records. The best way to avoid this is to group records based on some baseline commonality. If you intend to match records based on address, then there is no need to try to match records from different U.S. states together. If you intend to match records based on name and phone number, there is no need to try to match records from different area codes together. Based on your overall matching strategy, it should be relatively straightforward to group records based on a virtual key (ZIP code sectional center or first character and first consonant of last name, for example) to bring your match groups down to a reasonably efficient size for actual record-by-record comparisons.
Now that we have logical groups of records to match against one another, we need to figure out what fields make sense to determine a match. This relates back to the different relationships that we described—are you looking for households, individuals, companies, products? Based on what you want to identify, you can select the appropriate fields upon which to match. For contact data, this will likely be some combination of name, address, and geographic information. Usually you’ll want to compare some granular components (think house numbers, street names, street types), assuming these were standardized during earlier processing. In our example, if we grouped records based on sectional center, then there is no need to have state as a field to match on – they are already grouped at a lower level than state. If we were matching products, we may want to use fields like unit, size, color, type, or other descriptive information.
When matching elements together, you are typically using some type of character comparison to see how close the spellings of two terms are to each other. For example, if I have a record with street name “Maple” and another record with street name “Malpe”, those would tend to be considered pretty close when compared to each other. Most matching algorithms are looking for common typing mistakes – transposed characters, doubled characters, missing letters, etc. These types of differences are common, and are the usual suspects in keeping records from matching together without using “fuzzy matching” capabilities that account for such typos.
Another issue to keep an eye out for when matching elements is the length of the field and the type of field, both of which are illustrated with company names. A company name tends to be a longer field, which can incorporate more distinctions. You may also get abbreviations that have not been standardized (IBM vs International Business Machines). You may also have location information that can keep elements from looking the same (Kelly Automotive of Lynnfield vs Kelly Automotive of Danvers). Some of these issues can be handled by stronger standardization, and some by more flexible comparison routines (maybe checking full names with acronyms, or limiting comparisons to the first n number of characters).
So that’s the 411 on the science of data matching, stay tuned to check out the follow-up blog, where we’ll delve more deeply into the art of data matching.
Interested in learning more about the ins and outs of data matching?