Skip to main content

Find duplicates step - Installing a separate instance

Overview

The Find duplicates step utilizes a server that's embedded within Data Studio and therefore uses the same resources (e.g. CPU/memory) that have been allocated to Data Studio.

To allow dedicated resources to be allocated and make the step run faster, you can install a separate instance of the Find duplicates server and connect it to Data Studio.

A separate instance can be: 

  • local – the server is deployed on the same machine as Data Studio or
  • remote – the server is deployed on a separate machine.

There's currently a 1 million record processing limit when using an embedded or local instance of the Find duplicates server. To process volumes above 1 million records, you have to configure a remote instance.

To install a separate instance of the Find duplicates server and connect it to Data Studio, have a look at the requirements and then follow these steps:

  1. Install and configure the Standardize service (only applies to remote instances).
  2. Deploy the server.
  3. Connect the server to Data Studio.

If you upgrade your version of Data Studio, you have to also manually upgrade your separate Find duplicates server to the latest version  to maintain compatibility. If you're using a remote instance, you have to also upgrade the Standardize directory.

Requirements

Before installing a separate instance of the Find duplicates server, make sure you meet the requirements below.

Software requirements

  1. .NET Core 2.1 Runtime or above. 
  2. Java JRE 8.
  3. An application server capable of deploying a war file. We recommend Apache Tomcat.

Hardware requirements

The Find duplicates server is highly multi-threaded and will benefit from running on an enterprise grade server with as many CPU cores and memory as possible.

Hardware requirements differ depending on the number of records the Find duplicates step has to process, the quality of data, and its level of duplication.

Number of records  Requirements
Up to 1 million Small workload requirements
1-10 million Medium workload requirements
10+ million Large workload requirements

Installing and configuring Standardize

These instructions only apply if you're installing a remote instance of the Find duplicates server. 

The Find duplicates step uses an external service named GdqStandardizeServer to perform input standardization. This service has to be running for the Find duplicates step to work correctly. 

Setup

Prior to installing or starting the service, you need to copy the Standardize directory:

  1. On the machine where you have installed Data Studio, find the Standardize directory e.g. C:\Program Files\Experian\Aperture Data Studio {version number}\Standardize.
  2. Copy the Standardize directory to a location of your choice on your remote machine. We recommend:
    • C:\Program Files\Experian\Standardize if you are using Windows.
    • /home/<user>/experian/Standardize if you are using Linux.

Installing the service

Once you have copied the Standardize directory, you need to to install Experian.Gdq.Standardize.Web as a service. Follow the instructions below.

Windows

  1. In an Administrator PowerShell prompt, navigate to the Standardize directory.
  2. Run .\GdqStandardizeServiceManager.ps1 install. This will register the service in the Windows Services console.
  3. Run .\GdqStandardizeServiceManager.ps1 start. This will attempt to start the service.

Linux

  1. Navigate to the Standardize directory.
  2. As an administrator, run ./install_linux_prereqs.sh. This will install further prerequisites required for the service.
  3. As an administrator, run ./deploy_gdqs.sh. This will install and start the service.

Deploying the Find duplicates server

Install the latest stable version of your chosen application server. We recommend using Tomcat.

The Find duplicates server is deployed like any other web application by copying the supplied war file to the CATALINA_HOME\webapps directory.

You can find the war file in your Data Studio installation directory: C:\Program Files\Experian\Aperture Data Studio {version number}\findDuplicates.

If you're using Linux, ensure you have full access to:

  • Tomcat's installation directory.
  • Tomcat's installation sub-directories, including /conf, /webapps, /work, /temp, /db and /logs.

To check that your deployment was successful, go to: http://localhost:{port}/match-rest-api-{VersionNumber}/match/docs/index.html. The default Tomcat port is 8080.

Memory tuning

Ensure that you have as much memory allocated to the Tomcat JVM as possible.

The minimum heap size cannot be lower than 1GB. You can configure it by using the -Xms setting.

The maximum heap size should be set as high as possible while allowing sufficient memory for the operating system and any other running processes using the -Xmx setting:

export CATALINA_OPTS="$CATALINA_OPTS -Xms1g -Xmx12g" 

Encrypting the connection

When deploying a remote instance of the Find duplicates server, it can be set up to support an encrypted connection (HTTPS). Follow the steps within your application web server documentation to achieve this.

The JRE used by Data Studio will validate certificate trust. By default, the certificate must have a valid trust chain referencing a public Certificate Authority (CA). If a private CA is used to create the certificate, it must be added to the Aperture JRE keystore. This can be achieved as follows:

C:\Program Files\Experian\Aperture Data Studio {version number}\java64\jre\lib\security>..\..\bin\keytool.exe -importcert -keystore cacerts -file c:\path\to\ca\ca.cert.pem

The default keystore password is "changeit".

The keystore will be overwritten if Data Studio is upgraded and the CA will need to be re-added.

Logging

The instructions in this section only apply if you wish to configure logging information in greater detail.

When the Find duplicates server has been deployed using Tomcat, the findDuplicates.log and findDuplicatesCore.log files can be found in CATALINA_HOME\logs. Logging is handled by the log4j framework. The logging behaviour can be changed by updating the deployed log4j2.xml file, as described below.

On Linux, the log file path(s) must be specified explicitly in the log4j2.xml configuration file as shown below:

<Property name="LOG_DIR">${sys:catalina.home}/logs</Property>
<Property name="ARCHIVE">${sys:catalina.home}/logs/archive</Property>

Log levels

The log level is specified for each major component of the deduplication process within its own section of the log4j2 configuration file under the XML section <Loggers>. For example:

<Logger name="com.experian.match.rest.api" level="WARNING" additivity="false">
<AppenderRef ref="findDuplicatesLog"/>
</Logger>
<Logger name="com.experian.match.actorsys" level="WARNING" additivity="false">
<AppenderRef ref="findDuplicatesCoreLog"/>
</Logger>

This specifies that the logs will have a log level of WARNING, which is the recommended default for all components. Each component can have the logging level increased or decreased to change the granularity in the log file.

The components that may be individually configured are:

Component Description
com.experian.match.rest.api The overall application’s web controllers; the level set here is the default to be applied if none of the below are configured.
com.experian.match.actorsys The core deduplication logic.
com.experian.standardisation The API that interfaces to the standalone standardisation component.

The log levels in the log4j2.xml file follow the hierarchy presented in the table below. Therefore, if you set the log level to DEBUG, you will get all the levels below DEBUG as well. 

We recommend using the TRACE and ALL levels only for investigative purposes. They should not be used when processing large volumes of data through the Find duplicates step.

Level Description
ALL All levels.
TRACE Designates finer-grained informational events than DEBUG.
DEBUG Granular information, use this level to debug a package.
INFO Informational messages that highlight the progress of the application at coarse-grained level.
WARN Potentially harmful situations.
ERROR Error events that might still allow the application to continue running.
FATAL Severe error events that will presumably lead the application to abort.
OFF The highest possible rank. Intended to turn off logging.

Logging outputs

By default, the Find duplicates server is set to output the logs to CATALINA_HOME\logs within two separate log files called findDuplicates.log and findDuplicatesCore.log.

To change this, edit the below section of the log4j2.xml file:

<RollingFile name="findDuplicatesLog"
fileName="${LOG_DIR}/findDuplicates.log"
filePattern="${ARCHIVE}/findDuplicates.log.%d{yyyy-MM-dd}.gz">
<PatternLayout pattern="${PATTERN}"/>
<Policies>
<TimeBasedTriggeringPolicy/>
<SizeBasedTriggeringPolicy size="1 MB"/>
</Policies>
<DefaultRolloverStrategy max="2000"/>
</RollingFile>
<RollingFile name="findDuplicatesCoreLog"
fileName="${LOG_DIR}/findDuplicatesCore.log"
filePattern="${ARCHIVE}/findDuplicatesCore.log.%d{yyyy-MM-dd}-%i.gz">
<PatternLayout pattern="${PATTERN}"/>
<Policies>
<TimeBasedTriggeringPolicy/>
<SizeBasedTriggeringPolicy size="1000 MB"/>
</Policies>
<DefaultRolloverStrategy max="100"/>
</RollingFile>

Adjusting the fileName attribute allows you to change the name and location; for example, you may choose to output logging from all components into a single file, or different file names than the ones above.

Configuring Data Studio

Once you've installed a separate instance of the Find duplicates server, you can configure it in Data Studio:

  1. Go to Configuration Step settings Find duplicates.
  2. Toggle on Remote server: Enabled.
  3. Specify the following to connect to your Find duplicates server:
    Item  Description
    Remote server: Hostname The IP address or the machine name. Don't include the protocol (http://).
    Remote server: Path The location of the folder that was created when the war file was read. Default: match-rest-api-{VersionNumber}
    Remote server: Port The port number (8080 by default) used by the service. 
  4. If you deployed the Find duplicates server to use HTTPs, you have to toggle on Remote server: Use https (TLS/SSL) to encrypt the connection. 
  5. Click Test connection to ensure that the server information has been entered correctly and you can connect. If you receive a licensing error, this means that the server can be found and must now be licensed.

Troubleshooting

If you can't find an answer or a solution, contact support.

Find duplicates step failed to run

These instructions apply to scenarios where you have installed a remote instance of the Find duplicates server. If you have installed a local version of the server, check out these instructions.

If Standardize fails to start, you will see an error in Data Studio when trying to run the Find duplicates step: An error occurred running Find Duplicates step: Failed to connect to Standardization server.

To fix the issue, you need to check you have sufficient privileges to start system services:

  1. Search for services from the start menu or go to Control Panel > System and Security > Administrative Tools > Services.
  2. Locate GdqStandardizeServer, right-click it and select Properties.
  3. Open the Log On tab.
  4. Select This account and enter NT AUTHORITY\NETWORK SERVICE as the account.
  5. Click Apply to save changes.
  6. Right-click on the GdqStandardizeServer service and select Start.