Skip to main content

Aperture Data Studio - Find duplicates step

Use the step

We strongly recommend that you tag data before using this step. This will allow the relevant columns to be automatically selected.

This step is powered by Experian Match and allows you to identify similar contacts by appending a cluster ID and match status to each record.  

Use Data Studio to apply your Experian Match license: click on your username, select the Update license dialog and enter your license key.

Cluster ID

A cluster is a collection of records that have been identified as representing the same entity using the Find Duplicates rules. Each cluster is identified by the unique cluster ID.

Match status/level

This is the confidence level in the match. Each match between two records will have one of the following four levels:

  • Exact - each individual field that makes up the record matches exactly.
  • Close - records might have some fields that match exactly and some fields that are very similar.
  • Probable - records might have some fields that match exactly, some fields that are very similar and some fields that differ a little more.
  • Possible - records contain the majority of fields that have a number of similarities but do not match exactly. 
  • None - records do not match.

If your data has columns tagged already, this step will recognize the tagged address or name columns and list them as Selected columns.  

This step will only recognize the following system-defined tags:

  • Address
    • City
    • Country
    • County
    • Locality
    • Postal Code
    • Premise And Street
    • Province
    • State
    • Zip Code
  • Date
  • Email
  • Generic String
  • Phone
  • Name
    • Forenames
    • Surname
    • Title

Group IDs

You can apply different rulesets to columns with the same tag by using group IDs. 

For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing but a close one for the delivery address.

To apply a group ID to one or more columns, use the left-hand side menu in Workflow Designer: right-click on the column, select Configure column and enter the value for Group Id. Click Apply to save changes. 

Click Undefined blocking keys to specify an Experian Match blocking key set. Experian Match creates blocks of similar records to assist with the generation of suitable candidate record pairs for scoring. Blocks are created from records that have the same blocking key values. Blocking keys are created for each input record from combinations of the record’s elements that have been keyed.

To view the default and define your own blocking key sets, go to Glossary Experian Match blocking keys

Click Undefined ruleset to specify an Experian Match ruleset. A ruleset is a set of logical expressions (rules) that control how records are compared and how match statuses/levels are decided.

To view the default and define your own rulesets, go to Glossary > Experian Match rulesets

The following default blocking keys and rulesets are available:

  • Household - groups records with the same or similar family names at a similar address. For example, GBR_Household_Default will find households in Great Britain.
  • Individual - groups records with similar names at similar addresses. For example, GBR_Individual_Default will find individuals in Great Britain. Note that emails, phone numbers, and other identifiers will not be taken into account but can be added manually.
  • Location -  groups records with similar addresses or locations. For example, GBR_Location_Default will find locations in Great Britain.

You can connect to either a local or a remote instance of Experian Match. By default, Data Studio will connect to the locally installed instance which will be run automatically together with the Data Studio service.

If this workflow step was disabled at startup, or you had told it to point at a remote Experian Match server, you will have to restart Data Studio to connect to the local instance.

Note that there's currently a 100,000 limit to the records you can process using the local Experian Match instance (or a remote instance which is running on the same machine as Data Studio). 

If you've made any changes to the workflow or your data source after running a matching job, you may have to click Clear saved results to clear the cache containing results from the previous job. Note that this option will be disabled if there are no stored results/once the cache has been cleared.

Install a remote instance

To connect Data Studio to a remote instance of Experian Match you have to:

  1. Install and configure the Experian Match instance.
  2. Configure the connection in Data Studio.

Install Experian Match

1. Install and configure the Experian Match instance

Requirements

In order to deploy Experian Match, you have to deploy the API under an application server, install and configure the Standardisation service, and install your database drivers.

The system requires access to a JDBC compliant database to store its index tables. You may use your existing infrastructure, or deploy a dedicated instance.

Software requirements

  1. Windows Server 2012 or higher
  2. .NET Framework 4.5
  3. Java JRE 8
  4. Application server capable of deploying a war file. We recommend Apache Tomcat 8.5.

Hardware requirements

The matching system is highly multi-threaded and will benefit from running on a machine with multiple cores.

Minimum:

  • 2 CPU cores @ 2GHz+
  • 8GB RAM
  • 100MB HDD space (to install Standardisation reference data for only 1 country)

Recommended:

  • 8 CPU cores @ 2GHz+
  • 32GB RAM
  • 600MB disk (to install Standardisation reference data for all countries)

 

Install Standardize

Experian Match uses an external service GdqStandardizeServer to perform its input standardisation. This service has to be running for matching to work correctly.

1) Install data. Copy the Standardize directory to C:\Program Files\Experian\Standardize. By default, GdqStandardizeServer data should be copied to the Data folder within the install directory. 

2) Add the license key. Open the licence.ini and paste the license key we’ve provided. 

3) Install and run the Standardise service:

  1. In an Administrator  PowerShell prompt, navigate to the installation location.
  2. Run .\GdqStandardizeServiceManager.ps1 install. This will register the service in the Windows Services console.
  3. Run .\GdqStandardizeServiceManager.ps1 start. This will attempt to start the service.

 

Advanced configuration

You can modify the following parameters using the Experian.Gdq.Standardize.Standalone.config file:

FilePath

Path to the data which GdqStandardizeServer requires to function.

Default: './Data'

hostIp

The IP address which GdqStandardizeServer will use.

Default: 127.0.0.1 (i.e. localhost)

port

The port which GdqStandardizeServer will use.

Default: 5000

defaultCountry

The default country which GdqStandardize will use when processing records.

Default: GBR

defaultCountryInfluence

The default level of influence which GdqStandardize will use when processing records.

We recommend leaving this unchanged in most cases and if input records cover multiple countries.

A higher value, for example 500 can be used to force Experian Match to treat all input records as from the defaultCountry.

Default: 50

Override alias to rootname mappings

It's possible to override the file containing the alias to rootname mappings using standardisation.rootname.file.path.

Match standardisation port

By default, Experian Match is configured to use GdqStandardizeServer on 127.0.0.1:5000.

To change this, edit the application.properties file in the deployment directory. Add the following two properties: standardisation.host, and standardisation.port set to the required values.

Country processing

When standardising records, GdqStandardize needs to know what country the data in the record is referring to, in order to derive more information. Proper, country-specific standardisation affects the rest of the matching process, as it changes how potential matches are found.

The defaultCountry setting determines which country the standardisation system will assume the record is from. However, this can be overridden on a per-record basis by specifying an ISO 3166-1 alpha-3 code in the input data. The data should be mapped to the COUNTRY data type. For example:

Country

ISO 3166-1 alpha-3

United Kingdom

GBR

United States

USA

Australia

AUS

France

FRA

Deploy

Experian Match REST API has to be deployed under an application server.

Instructions below are for Apache Tomcat.

Install the latest stable version of Tomcat (currently 8.5) according to the Apache installation instructions.

Experian Match REST API is deployed like any other web application by copying the supplied war file to the CATALINA_HOME\webapps directory.

To check that your deployment was successful, navigate to http://localhost:{port}/matching-rest-api-{VersionNumber}/swagger-ui.html. The default Tomcat port is 8080.

Memory tuning

Ensure that you have as much memory allocated to the Tomcat JVM as possible.

We recommend to set the minimum heap size to 1GB using the -Xms setting. The maximum heap size should be set as high as possible while allowing sufficient memory for the operating system and any other running processes using the -Xmx setting.

export CATALINA_OPTS="$CATALINA_OPTS -Xms1g -Xmx12g" 

Database drivers

If you are planning to connect to a SQL database, you will have to make sure that you have installed the relevant JDBC database drivers.

If your configurations only include flat files, Mongo, or HSQL, the following steps are not required and can be skipped. Installation of the SQL drivers can be done at a later date by following the steps and restarting the application.

To install the SQL Server drivers:

  1. Download the JDBC Driver 6.0 for SQL Server.
  2. Unzip this to a location of your choice and navigate to this location.
  3. Copy the sqljdbc42.jar file located in {EXTRACT LOCATION}\sqljdbc_6.0\enu\jre8 to your CATALINA_HOME\lib directory.

To install the Oracle drivers:

  1. Download the ojdbc6.jar file.
  2. Copy the ojdbc6.jar file to your CATALINA_HOME\lib directory.

You will have to supply the relevant driver name when making an API request that includes the connectionSettingsobject

 

Configure

Experian Match has the following configuration properties:

Name Type Default Description
matchstore.purge Boolean False

Removes tables created in the JDBC match store after job completion. We advise setting this to true for Data Studio users.

matching.configLocation  java.lang.String  Configuration will not be persisted A directory where configuration settings will be persisted 
standardisation.rootname.file.path  java.lang.String The internal root name file will be used The location of an alias to rootName file 

 Add configuration using Tomcat

  1. Create an xml file under CATALINA_HOME\conf\Catalina\localhost\. This file should have the same name as the deployed WAR file. If the WAR has been deployed under ROOT, the configuration file should be called root.xml.
  2. Add the required configuration as an Environment property in the <Context> block. For example:
<Context>
    <Environment name="matching.configLocation" value="c:\Experian\Match\config" type="java.lang.String" override="true" />
    <Environment name="standardisation.rootname.file.path" value="c:\Experian\Match\rootNames.txt" type="java.lang.String" override="true" />
</Context>

A Tomcat restart is required to load the settings after this file is created or modified.

matching.configLocation

We recommend that the save location is set to a directory outside the Tomcat installation directory, as this is often a volatile directory, and not suitable for user data.

This directory should be included as part of a standard backup policy.

standardisation.rootname.file.path

The file should contain no header and only two columns separated by a comma:

  1. Alias - the first column contains the name aliases. These should be unique. If an alias appears more than once, only the last entry that appears in the file will be used during Standardisation.
  2. RootName - the second column contains the root name, which the alias maps to. The same root name can appear multiple times in this column.

For example:

ABBY,ABIGAIL
ABDLE,ABDUL
ABDOU,ABDUL
ABDUH,ABDUL
ABY,ABIGAIL

Tomcat hast to be restarted after this file is created or modified.

Persistent configuration

By default, configurations created via the REST API are not saved to disk. As such, when the application server shuts down (or is redeployed) any configurations are lost, and must be recreated on start-up.

To enable persistent configuration, set a path using the matching.configLocation property. 

You should now configure the connection in Data Studio. 

Configure Data Studio

Once you've installed and configured the Experian Match instance, the second and final step is to configure the connection in Data Studio. 

2. Configure the connection in Data Studio

  1. Go to Configuration Step settings Find duplicates.
  2. Click Remote Match: Enabled.
  3. Specify the following to connect to your Experian Match server:
    • Remote Match: Hostname - the IP address
    • Remote Match: Path - the location of the folder that was created when .war file was read
    • Remote Match: Port - the port number (8080 by default) used by the service. Depending on your setup, you may also have to unblock port 61616.
  4. Click JDBC Match store connection: Enabled.
  5. Specify the following:
    • JDBC Match store: Connection string
    • JDBC Match store: Driver name
    • JDBC Match store: Password
    • JDBC Match store: Username 

If your Data Studio and Match instances are on different networks (e.g. on a VM):

  1. Click Aperture Server Hostname (for Match) and enter the hostname of the machine with Data Studio installed. Note that this IP has to be accessible to the machine with Experian Match installed.

 

Troubleshooting

If you can't find an answer or a solution, contact support.

Service 'Experian GDQ Standardize Server' (GdqStandardizeServer) failed to start. Verify that you have sufficient privileges to start system services.

To start the service:

  1. Search for services from the start menu or go to Control Panel > System and Security > Administrative Tools > Services.
  2. Locate the Experian GDQ Standardize Server, right-click and select Properties.
  3. Open the Log On tab.
  4. Select This account and enter NT AUTHORITY\NETWORK SERVICE as the account.
  5. Click Apply to save changes.
  6. Right-click on the Experian GDQ Standardize Server service and select Start.