/
Deduplicate

Deduplicate

Overview

The Deduplicate function is a processor that you can use to find duplicate records among the collected records. In the stream editor, it must be connected to a collector or processor. You can configure the function to check for duplicate records based on all columns or based on specific columns.

For each record, the columns you select are compared to all the records in the cache to check for duplicates. If all specified columns match a record previously processed within the same cache period, the record is identified as a duplicate record. The duplicate record can then be either discarded from the stream or handled in a separate output channel.

Unique records are stored in the cache. Records in the cache must be forwarded for further processing. Records are discarded from the cache after the number of days that you have specified in Records removed from the cache.

Configuration

The Deduplicate function has the following settings:

General Setting

Options

Description

General Setting

Options

Description

Checking options

Every column

Select to check for duplicates in all columns.

Specific columns

Select specific columns from the drop-down list in Check these columns, or type the fields one by one into the Add field box and click on the + button.

Records removed from cache after

Specify the number of days after which the records must be removed from the cache memory. The default value is 1 day. Records in the cache memory can be stored in the database for up to 70 days.

Caution!

This number should be kept as low as possible for performance reasons.

Handling duplicates

Discard

Select to discard the duplicates.

Create new output

Select to add a second output channel for duplicates. This lets you examine the duplicates and act upon them if required.

Note!

When you modify the settings, the records that are already stored in the cache are not affected. Your changes to the settings only apply to records that are processed after you have changed the settings.