Deduplicate
Overview
The Deduplicate function is a processor that you can use to find duplicate records among the collected records. In the stream editor, it must be connected to a collector or processor. You can configure the function to check for duplicate records based on all columns or based on specific columns.
For each record, the columns you select are compared to all the records in the cache to check for duplicates. If all specified columns match a record previously processed within the same cache period, the record is identified as a duplicate record. The duplicate record can then be either discarded from the stream or handled in a separate output channel.
Unique records are stored in the cache. Records in the cache must be forwarded for further processing. Records are discarded from the cache after the number of days that you have specified in Records removed from the cache.
Configuration
The Deduplicate function has the following settings:
General Setting | Options | Description |
---|---|---|
Checking options | Every column | Select to check for duplicates in all columns. |
Specific columns | Select specific columns from the drop-down list in Check these columns, or type the fields one by one into the Add field box and click on the + button. | |
Records removed from cache after | Specify the number of days after which the records must be removed from the cache memory. The default value is 1 day. Records in the cache memory can be stored in the database for up to 70 days. Caution! This number should be kept as low as possible for performance reasons. | |
Handling duplicates | Discard | Select to discard the duplicates. |
Create new output | Select to add a second output channel for duplicates. This lets you examine the duplicates and act upon them if required. |
Note!
When you modify the settings, the records that are already stored in the cache are not affected. Your changes to the settings only apply to records that are processed after you have changed the settings.