Skip to main content

Aperture Data Studio - Find duplicates step

Overview

We strongly recommend that you tag data before using this step. This will allow the relevant columns to be automatically selected.

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

The Find duplicates step is most commonly used in a process to create a single customer view (SCV). The step helps you establish a duplicate store, which allows you to:

  • Locate duplicate records within existing systems.
  • Establish linkage across data silos.

Once you've established your duplicate store, you can use the Find duplicates REST API to search for and maintain records within it. Find out more.

Use Data Studio to apply your Find duplicates step license: click on your username, select the Update license dialog and enter your license key.

Key concepts

This section covers key concepts related to the Find duplicates step.

Cluster ID

A cluster is a collection of records that have been identified as representing the same entity using the Find duplicates rules. Each cluster is identified by a unique cluster ID.

Match status/level

Each match between two records will have one of the following confidence levels:

Match status Description
Exact (0) Each individual field that makes up the record matches exactly.
Close (1) Records might have some fields that match exactly, and some fields that are very similar. 
Probable (2) Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more.
Possible (3) Records contain the majority of fields that have a number of similarities, but do not match exactly.
None (4) Records do not match.

Tags

If your data has columns tagged already, this step will recognize the tagged columns and list them as Selected columns.  

This step will only recognize the following system-defined tags:

  • Address
    • City
    • Country
    • County
    • Locality
    • Postal Code
    • Premise And Street
    • Province
    • State
    • Zip Code
  • Date
  • Email
  • Generic String
  • Phone
  • Name
    • Forenames
    • Surname
    • Title
  • Unique Id

Group IDs

You can apply different rulesets to columns with the same tag by using group IDs. 

For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address, but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing address, but a close one for the delivery address.

To apply a group ID to one or more columns, use the left-hand side menu in Workflow Designer:

  1. Right-click on the column.
  2. Select Configure column and enter the value for Group ID.
  3. Click Apply to save the changes. 

Rulesets and blocking keys

The Find duplicates step creates blocks of similar records to assist with the generation of suitable candidate record pairs for scoring. Blocks are created from records that have the same blocking key values.

Blocking keys are created for each input record from combinations of the record’s elements that have been keyed. Keying is the process of encoding individual elements to the same representation so that they can be matched despite minor differences in spelling.

Click Undefined blocking keys to specify a blocking key set. 

To view the default and define your own blocking key sets, go to Glossary Find Duplicates blocking keys. Find out how to create your own blocking keys.

A ruleset is a set of logical expressions (rules) that control how records are compared and how match statuses/levels are decided.

Click Undefined ruleset to specify a ruleset. 

To view the default and define your own rulesets, go to Glossary > Find Duplicates rulesets. Find out how to create your own rules.

Note: Consumer users will not be able to access the Glossary. Find out about user roles.

The following default blocking keys and rulesets are available:

  • Individual - groups records with similar names at similar addresses. For example, GBR_Individual_Default will find individuals in Great Britain. Note that emails, phone numbers, and other identifiers will not be taken into account, but can be added manually.
  • Household - groups records with the same or similar family names at a similar address. For example, GBR_Household_Default will find households in Great Britain.
  • Location -  groups records with similar addresses or locations. For example, GBR_Location_Default will find locations in Great Britain.

Retaining a duplicate store

You can retain your duplicate store to disk, so it can be used for searching and maintenance operations.

Duplicate stores are retained to your machine's Data Studio repository, within the experianmatch sub-directory. However, if you have configured a separate instance of the Find duplicates server, duplicate stores will be retained on that same machine.  

To retain a duplicate store when using the Find duplicates step:

  1. Tick the Retain the duplicate store checkbox.
  2. Enter a name for your duplicate store.
  3. Click Show data to run the step and retain it to disk. 

Note that executing the entire workflow (instead of running the Find duplicates step separately within the workflow) will retain the duplicate store in the same way.

Connecting to a Find duplicates server

You can connect to either an embedded (in Data Studio) or a separate instance of the Find duplicates server. By default, Data Studio will connect to the embedded instance, which will run automatically together with the Data Studio service. 

If this workflow step was disabled at startup, or you had told it to point at a separate Find duplicates server, you will have to restart Data Studio to connect back to the embedded instance.

Note: There is currently a 1 million record processing limit when using the default deployment settings for the Find duplicates step.

If you've made any changes to the workflow or your data source after running the Find duplicates step, you may have to click Clear saved results to clear the cache containing the results. Note that this option will be disabled if there are no stored results/once the cache has been cleared.

Troubleshooting

If you can't find an answer or a solution, contact support.

Find duplicates job failed

If a Find duplicates job fails to run or start, you will see the following error: An error occurred running Find Duplicates step: Failed to connect to Standardization server.

To fix the issue:

  1. Search for services from the start menu or go to Control Panel > System and Security > Administrative Tools > Services.
  2. Locate Experian Aperture Data Studio {version number} Standardize Server, right-click it and select Properties.
  3. Open the Log On tab.
  4. Select This account and enter NT AUTHORITY\NETWORK SERVICE as the account.
  5. Click Apply to save changes.
  6. Right-click on Experian Aperture Data Studio {version number} Standardize Server and select Start.