Deduplicate

Overview

The Deduplicate function is a processor that enables you to filter duplicate records from the collected records. On the stream editor, it must be connected to a collector or processor. You are required to configure if you want, all of the columns in your collected records to be checked for duplicates or just specific columns. 

For each record, the columns that you have selected are compared against all the records in the cache to check for duplicates. If all columns that you have specified, match a record previously processed within the same cache period, the record is identified as a duplicate record. The duplicate record can be either discarded from the stream or handled in a separate output channel.

Unique records, that is, no match is found, are stored in the cache. Records in the cache must be forwarded for further handling. Each record is discarded from the cache after the number of days that you have specified in Records removed from cache after.

Configuration

You can configure the Deduplicate function by setting the following options:

General SettingOptionsDescription
Checking optionsEvery columnSelect to check for duplicates in all columns.
Specific columnsSelect to check for duplicates in specific columns and select the name of the column(s) in Check these columns.
Records removed from cache afterSpecify the number of days after which the records must be removed from the cache memory. The default value is 1 day. Records in the cache memory can be stored in the database for a maximum number of 30 days.

Max size of the cache

Specify the size of the cache (number of records in thousands) with the latest records to be checked for duplicates. The default size is 100 thousand records. The allowed maximum size of the cache is 5 million records i.e. 5000 thousand records.

This is not the same as the previous option where the cache is stored in the database and then removed after a certain period of time. This is the cache memory when a number of records are stored during stream execution.

For example, if the Max size of the cache is set to 100 thousand records, during stream execution, the cache memory will store the latest 100 thousand records to be checked for duplicates. In case more data is generated beyond the specified limit, only the latest 100 thousand records will be checked for duplicates.
Handling duplicatesDiscardSelect to discard the duplicates.
Create new outputSelect to add a second output channel for duplicates. This lets you examine the duplicates and act upon them if required.


Note!

When you modify the settings, the records that are already stored in the cache are not affected. Your changes to the settings only apply to records that are processed after you have changed the settings.