Skip to main content

Aperture Data Studio - Harmonize duplicates step

Overview

Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.

For example, a customer is represented in multiple systems but each system has a different amount of detail or even conflicting information on record. Similarly, a single system may have multiple instances for the same underlying real-world person.

The Harmonize duplicates step takes groups of records that are considered to be duplicates (we call them clusters) and allows you to create a single resulting record for each cluster by applying sophisticated business rules.

Cluster ID

A cluster is a group of records that are considered to be duplicates (i.e. represent the same underlying real-world subject). For the Harmonize duplicates step to identify which clusters have to be harmonized into a single record, a column containing a unique identifier for each cluster is required.

The Find duplicates step automatically creates a Duplicates: Cluster Id column, so a subsequent Harmonize duplicates step will automatically select this column.

However, if your clusters have been identified already (using Data Studio or other means), you can simply select the column that contains your cluster IDs without using the Find duplicates step.

Record processing

Depending on your requirements, you can choose one of these methods to process the duplicate records:

  • Select best record (default option): identifies a single record in the group as the ‘surviving record’. All other records are ignored.

  • Select and complete best record: identifies a single record in the group as the ‘surviving record’. If there are any missing values in the surviving record, the best values from the remaining records are used.

  • Merge best values: no single record is identified as the ‘surviving’ one; it's formed by combining the best values from each of the candidate records.

Record/value priority

This determines the criteria that will be applied when selecting the best record/values. There are four methods for selecting which records/values to give priority to.

Most populated (default option)

This option is only available when Select best record or Select and complete best record processing methods are selected.

Selects the record with the highest number of non-null column values. This option is essentially selecting the ‘most complete’ record without any regard to the quality of the data within it.

Total length
Selects the record/value that has the highest total character count (the length of the value in each column added together).

Score based
This is the most powerful and flexible of all the options - it's essentially a custom method for determining the priority of records/values.  You identify a column which will contain a score for the record/value:

  • a numeric column - the record with the highest value will be selected
  • a column containing true/false values -  the priority will be given to records with true values

The Lowest score first option allows you to flip the priority so that records with the lowest score (or that are false) are selected where possible.

Typically, a score column (or set of columns) will be added in a previous Transform step, where all of the power of Data Studio functions can be used to execute the business logic required to determine the suitability of a record or its values.

Example 1:
Your records contain a Date last updated column and you want the surviving record to be based on whichever record was updated the most recently. To do this, add a new Days since updated column in a Transform step and using the appropriate functions derive this value from the Date last updated column. This new column can now be used as the ‘score’ column.

Example 2:
You have a column indicating the source system for the record and you have a preference for certain systems in terms of the quality of the data they're likely to contain. A new 'score' column could be added that derives a score on the basis of the source system. You can apply even more complex rules, such as preferring different source systems for customer records from different categories or geographies.

Most common

This option is only available when Merge best values record processing method is selected.

When there are more than two records in a cluster and they contain inconsistent values, it often makes sense to choose the value that appears the highest number of times

For independent columns, this is straightforward and the value that occurs most frequently within the column will be selected. However, processing multiple columns together is more complex. Firstly, the most common non-null value in the first column in the group is identified, then any rows that do not share the value are discarded. This is repeated on the second and the subsequent columns until a unique row has been identified.

Tie-breaking

The processing rules can always result in two records/values being tied (same number of non-null columns, same length values, same score, etc).

In this case, we first tie-break by using the Total length rule. If we’re still tied after that, then we sort the values of the first column, then second, and so on, until we have a record at the top of the sorting.

Unselected records

The Unselected records output allows you to view records that weren't selected as the best version of an entity in the harmonization process.

Note that these records were judged to be non-best according to the default record processing rules, so this output may contain values that were merged into the best/'surviving' record.

If the Merge best values column processing was selected, this output will be empty because there is no concept of a single best record.

Column grouping

When combining values from multiple records (i.e. either the Select and complete best record or Merge best values processing method was chosen), we strongly recommend that your harmonization rules are applied to column groups instead of individual columns.

For example, you wouldn't want to use Address Line 1 column from one record and Address Line 2 from another as they could represent two entirely separate addresses. Instead, you'd group all the address columns together and apply business rules to the group in order to find the best overall address. Similarly, you may want to group columns such as {Latitude and Longitude}, and {DOBYear, DOBMonth, DOBDay}.

To create a column group, click Additional options in the Harmonize duplicates step then New group.

There are two ways to specify which columns will be added to the group:

  • Column List: pick the individual columns by column name (click Edit);
  • Data Tag: select all columns with the required tag.

The column group allows you to choose any of the value priority types to apply to all columns in the group together.  You will often want to apply different priority types to different column groups (for example you might favor using mailing addresses from your e-commerce delivery system, but email addresses from a social media system).

All columns not targeted by a column group will be processed by the default processing behavior as defined in the Harmonize duplicates step.

Overlapping groups

Each column can only be processed once but could be targeted by multiple column groups (e.g. if it has multiple data tags applied). In this case, the priority is given to the first column group in the list, and that column will be ignored by any subsequent column groups.

To make this more obvious, each column group has a list of Affected columns containing the names of all columns targeted by this group. If any columns in this list are also targeted by any column groups higher up the list, then they are shown to be crossed-out because they will be ignored in this instance.

If you have more than one column group, you can move the groups up/down in the list to change the order they will be processed in.

Ignoring incorrect values

In some cases, it's possible to identify low quality/incorrect values. If you specify a Filter column for a column group, then only the records that contain the Filter value in this column will be considered.

For example, you may have an email column in your data but want to make sure that no invalid emails make it into your 'surviving' record. To achieve this, you would need to create a column that dictates whether the email is valid or not (see the Validate emails step). Then a column group containing just the ‘Email’ column could select the ‘Is Valid Email’ column as the filter column, and the filter value would be true. The resulting harmonization would ignore any values from the ‘Email’ column where the ‘Is Valid Email’ column does not contain true, so all emails in the output are valid. This is used in situations where you would rather have no email address than an invalid one. Filter columns are applied before the prioritization process takes place (so, for example, the ‘most common’ priority rule would only count occurrences of valid emails, and ignore all invalid emails).

Another situation where Filter Columns are useful is where a source system outputs an ‘inactive’ or ‘do not use’ flag against legacy or outdated records, where you may want to avoid using such data in some or all fields.

If you want to apply more complex criteria when deciding which values should be ignored by harmonization, it's often easiest to add a Transform step earlier in the workflow, to output a new column indicating whether or not each value should be considered.