Duplicate Filter Collection Strategy(5.0)
This section includes a description of the Duplicate Filter Collection Strategy that is applied in
with the Disk, FTP, FTPS, HDFS, SCP and SFTP Collection agents.Overview
The Duplicate Filter Collection Strategy enables you to configure a collection agent to collect files from a directory without having the same files being collected again.
Mechanism
When the agent reads the input folder and collects files, the system will insert the last collected files into a list, known as the File List. The amount of collected files to be kept in this File List at any one time is determined by the File List Size setting. This File List will be used to check whether an input file is a duplicated file.
When Duplicate Filter Collection Strategy is turned on, the system collects input files by the file's Modification Timestamp in ascending (ASC) order. Since the Duplicate Filter has its own sorting mechanism, it is important that you do not enable Sort Order in the agent when selecting Duplicate Filter Collection Strategy.
There are two ways to determine if a file is a duplicated file during collection:
Duplicate Criteria - Filename
If the file exists in the File List, the agent is not going to collect it. This process is indicated by the flow chart below:
Duplicate Criteria - Filename
Duplication Criteria - Filename and Timestamp
If the file exists in File List and it contains a recent Modification Timestamp than it's previous Modification Timestamp, the agent will re-collect the newer file. This process is indicated by the flow chart below:
Duplicate Criteria - Filename and Timestamp
There is a possibility where the input folder will contain more files than the File List. In the situation where the input file doesn't exist in the File List, the system will check the Modification Timestamp of the oldest file in the list. The system collects it only if the Modification Timestamp is older than the oldest file in the File List as the Duplicate Filter always collects files by it's Modification Timestamp ascending(ASC) order.
Example!
If the File List Size is set to 5 and there is currently 15 files in the input folder. Add to that the file Modification Timestamp is in ascending (ASC) order just as listed in the picture below.
When the workflow is executed, the last 5 collected files will be inserted into the File List. When there are new files coming into the input folder, new files should be newer than the oldest file in the File List, in this case file 11.txt.
Every time the workflow executes and collects new files, the older files in File List will be removed and the recently collected files will be inserted.
Example - when input files > File List Size
Note!
You should make an estimation on how many files will be going into the input folder per second and set the File List Size with a number that is bigger than this estimation. Consider the case were if you have the total input files > File List Size and all the file's Modification Timestamp are identical, there may be duplicate files being collected.
Tip!
It is always good practice to have good housekeeping and clean up the source folder from which your workflow collects the input files. This will make the Duplicate Filter run faster during workflow execution instead of repeatedly going through a huge file list to check for duplication.
Configuration
You configure the Duplicate Filter Collection Strategy from the Disk tab for Disk Collection agent, the HDFS tab for the HDFS Collection agent and the Source tab for FTP, FTPS, SCP and SFTP Collection agents in the agent configuration dialog.
To Configure the Duplicate Filter Collection Strategy:
The Duplicate Filter configuration dialog
Setting | Description |
---|---|
From the drop-down list select Duplicate Filter. | |
Directory | Absolute pathname of the source directory on the remote host, where the source files reside. The pathname might also be given relative to the home directory of the User Name account. |
Include Subfolders | Select this check box if you have subfolders in the source directory from which you want files to be collected. Note! Subfolders that are in the form of a link are not supported. If you select Enable Sort Order in the Sort Order tab, the sort order selected will also apply to subfolders. |
Filename | Name of the source files on the remote host. Regular expressions according to Java syntax applies. For further information, see https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html. Example To match all file names beginning with |
Compression | Compression type of the source files. Determines if the agent will decompress the files before passing them on in the workflow.
|
Duplicate Criteria - Filename | Select this option to have only the filename compared for the duplicate check. If the filename is in the list of files which have already been collected once, the file is ignored by the agent. Refer to the Mechanism section above for more information. |
Duplicate Criteria - Filename and Timestamp | Select this option to have both the filename and the time stamp of the last modification, compared when checking for duplicates. If the file has already been collected once, it is collected again only if the duplicate check reveals that the file has been updated since the previous collection. Refer to the Mechanism section above for more information. |
File List Size | Enter a value to specify the maximum size of the list of already collected files. This list of files is compared to the input files in order to detect duplicates and prevent them from being collected by the agent. Refer to the Mechanism section above for more information. When this collection strategy is used with multiple server connection strategy, each host has its own duplicate list. If a server is removed from the multiple server configuration the collection strategy will automatically drop the list of duplicates for that host in the next successful collection. |
Use File Reference/Route FileReferenceUDR | Select this checkbox to route File Reference UDR instead of raw data. |