Deduplicate

Overview

The Deduplicate function is a processor that you can use to find duplicate records among the collected records. In the stream editor, it must be connected to a collector or processor. You can configure the function to check for duplicate records based on all columns or based on specific columns.

For each record, the columns you selected are compared against all the records in the cache to check for duplicates. If all specified columns match a record previously processed within the same cache period, the record is identified as a duplicate record. The duplicate record can then be either discarded from the stream or handled in a separate output channel.

Unique records are stored in the cache. Records in the cache must be forwarded for further processing. Records are discarded from the cache after the number of days that you have specified in Records removed from the cache after.

Configuration

You can configure the Deduplicate function by setting the following options:

General Setting

Options

Description

General Setting

Options

Description

Checking options

Every column

Select to check for duplicates in all columns.

Specific columns

Select specific columns from the dropdown list in Check these columns, or type the fields one by one into the Add field box and click on the + button.

Records removed from cache after

Specify the number of days after which the records must be removed from the cache memory. The default value is 1 day. Records in the cache memory can be stored in the database for up to 70 days.

Caution!

This number should be kept as low as possible for performance reasons.

Maxcache -Max size of the cache

Specify the size of the cache (number of records in thousands) with the latest records to be checked for duplicates. The default size is 5 million records, that is 5000 thousand records, and the minimum size is 100 thousand records.

This is not the same as the previous option where the cache is stored in the database and then removed after a certain period of time. This is the cache memory when a number of records are stored during stream execution.

Example maxcache

If the Max size of the cache is set to 100 thousand records, during stream execution, the cache memory will store the latest 100 thousand records to be checked for duplicates. If more data is generated beyond the specified limit, only the latest 100 thousand records will be checked for duplicates.

Handling duplicates

Discard

Select to discard the duplicates.

Create new output

Select to add a second output channel for duplicates. This lets you examine the duplicates and act upon them if required.

Note!

When you modify the settings, the records that are already stored in the cache are not affected. Your changes to the settings only apply to records that are processed after you have changed the settings. 

Â