Duplicate UDR (3.0)

The Duplicate UDR profile is loaded when you start a workflow that depends on it. Changes to the profile become effective when you restart the workflow.

Configuration

To create a new Duplicate UDR profile configuration,

  1. Click Configuration  in the upper left part of the  Desktop window.
  2. Select Duplicate UDR Profile from the menu.


The Edit menu is specific for Duplicate UDR profile configurations.

ItemDescription

External References

Select this menu item to Enable External References in an agent profile field. Refer to Enabling External References in an Agent Profile Field in External Reference (3.0) for further information.

External References can be used with the fields:

  • Directory
  • Max Cache Age
  • Max Cache Size

The Duplicate UDR profile configuration contains the following settings:

SettingDescription

Storage Host

The preferred storage host, where the duplicate UDRs are to be stored, must be selected. The available options are:

  • Specific EC 
  • Automatic. If Automatic is selected, the same EC used by the running workflow will be selected, or when the Duplicate UDR Inspector is used, the EC is automatically selected.

Note!

The workflow must be running on the same EC as its storage resides, otherwise the Duplicate UDR Agent will refuse to run. If the storage is configured to be  Automatic , its corresponding directory must be a file system shared between all the ECs.

Directory


An absolute path to the directory on the selected storage host, where the duplicate cache is stored.

If this field is greyed out with a stated directory, it means that the directory path has been hard coded using the mz.present.dupUDR.storage.path property. This property is set to false by default.

For further information about all available system properties, see Properties(3.0)/wiki/spaces/MD/pages/2950266

Max Cache Age (days)

The maximum number of days to keep UDRs in the cache. The age of a UDR stored in cache is either calculated from the Indexing Field (timestamp) of a UDR in the latest processed batch file, or from the system time, depending on whether Based on System Arrival Time or Based on Latest Time Stamp in Cache is selected.

If the Date Field option, below, is not selected as indexing field, this field will be deactivated and ignored, and cache size can only be configured using the Max Cache Size settings. The default value is 30 days.

Note!

Duplicate checking is not performed if the processed UDRs are too old, this will be logged in the System Log. However, the age calculation cannot be performed if the cache is empty.

Based On System Arrival Time

When this radio button is selected (default), the calculation of cached UDR's age will be based on the time when a new batch is being processed.

In case of a longer system idle time, this setting may have a major impact on which UDRs that are removed from the cache. For more information about the difference between Based on System Arrival Time and Based on Latest Timestamp in Cache when calculating the UDR age, see the section below, Using Indexing Field Instead of System Time.

Based on Latest Timestamp in Cache

When this radio button is selected, the UDR cache age calculation will be made towards the latest Indexing Field (timestamp) of a UDR that is included in the previously processed batch files.

For more information about the difference between Based on System Arrival Time and Based on Latest Timestamp in Cache when calculating the UDR age, see the section below, Using Indexing Field Instead of System Time.

Max Cache Size (thousands)

The maximum number of UDRs to store in the duplicate cache. The value must be in the range 100-9999999 (thousands), default is 5000 (thousands). The cache will be made up of containers covering 50 seconds each, and for every incoming UDR, it will be determined in which cache container the UDR will be stored

During the initialization phase, the agent checks whether the cache is full or not. If the check indicates that there will be less than 10% of the cache available, cache containers will start to be cleared until 10% free cache is reached, starting with the oldest container. Depending on how many UDRs are stored in each container, this means that different amounts of UDRs may be cleared depending on the setup. If the index field happens to have the same value in all the UDRs, all of the UDRs in the cache will be cleared. 

Note!

If you have a very large cache size, it may be a good idea to split the workflows in order to preserve performance. 


Enable Separate Storage Per WorkflowThis option enables each workflow to have a separate storage that is checked for duplicates. This allows multiple workflows to run simultaneously using the same Duplicate UDR profile. However, if this checkbox is selected, a UDR in a workflow will not be checked against UDRs in a different workflow.

Type

The UDR type the agent will process.

Indexing Field

The UDR field used as an index in the duplicate comparison. Fields of type long and date are valid for selection.

For performance reasons, this field should preferably be either an increasing sequence number, or a timestamp with good locality. This field will always be implicitly evaluated.

For further information, see the section below, Using Indexing Field instead of System Time.

Date Field

If selected (default), the indexing field will be treated as a timestamp instead of a sequence number, and this has to be selected to be able to set the maximum age of UDRs to keep in the cache in the Max Cache Age (days) field above.

Note!

If the selected indexing field is a timestamp that is configured to be 24 h or more ahead of the system time, the workflow will abort.

Checked Fields

The fields to use for the duplication evaluation, when deciding whether or not a UDR is a duplicate.

Note!

If the Checked Fields or Indexing Field are modified after an agent is executed, the already stored information will be considered useless the next time the workflow is activated. Hence, duplicates will never be found amongst the old information since other type of meta data has replaced them.

Using Indexing Field Instead of System Time

The "cache time window" (see the figure below) decides whether a UDR shall be removed from the cache or not. The maximum number of days to store a UDR in the cache is retrieved from the setting Max Cache Age (days) each time a new batch file is processed (and the age of the UDRs is calculated). The "cache time window" will be moved forward and old UDRs will be removed.

Calculation of the UDR age can be done in two ways:

  • Using the latest indexing field (timestamp) of a UDR that is included in the previously processed batch files.
  • Using system time.

The following figure illustrates the difference:

UDR removed from cache based on indexing field or system time

If the system has been idle for an extended period of time, there will be a "delay" in time. So when a new batch file is processed, and if system time is used for UDR age calculation, the "cache time window" will be moved forward with the delay included, and this might result in all UDRs being removed from the cache, as shown in the figure above. The consequence of this is that the improperly removed UDRs will be considered as non-duplicates and, hence, might be handled even though they still are duplicates.

If the indexing field is used instead, a more proper calculation will be done, since the "system delay time" will be excluded. In this case only UDR 1 and UDR 2 will be removed.