9.20.2 Duplicate UDR Detection Profile

A Duplicate UDR Detection agent is configured in two steps. First, a profile has to be defined, then the regular configurations of the agent are made.

The Duplicate UDR profile is loaded when you start a workflow that depends on it. Changes to the profile become effective when you restart the workflow.

Configuration

To create a new Duplicate UDR profile configuration, click the  New Configuration button in the upper left part of the Desktop, and then select  Duplicate UDR Profile from the menu.

Duplicate UDR profile dialog

The contents of the menus in the menu bar may change depending on which configuration type that has been opened. The Duplicate UDR profile uses the standard menu items and buttons that are visible for all configurations, and these are described in 2.1 Menus and Buttons.

The Edit menu is specific for Duplicate UDR profile configurations.

ItemDescription

External References

Select this menu item to Enable External References in an agent profile field. Refer to Enabling External References in an Agent Profile Field in 8.11 External Reference Profile for further information.

External References can be used with the fields:

  • Directory
  • Max Cache Age
  • Max Cache Size

The Duplicate UDR profile configuration contains the following settings:

SettingDescription

Storage Host

In the drop-down menu, the preferred storage host, where the duplicate UDRs are to be stored, can be selected. The choice for storage of duplicate repositories is either on a specific ECor automatic. If  Automatic  is selected, the same EC used by the running workflow will be selected, or when the Duplicate UDR Inspector is used, the EC is automatically selected.

Note!

The workflow must be running on the same EC as its storage resides, otherwise, the Duplicate UDR Detection Agent will refuse to run. If the storage is configured to be  Automatic , its corresponding directory must be a file system shared between all the ECs.

Directory


An absolute path to the directory on the selected storage host, in which to store the duplicate cache.

If this field is greyed out with a stated directory, it means that the directory path has been hard-coded using the mz.present.dupUDR.storage.path property. This property is set to false by default.

Example - Using the mz.preset.dupUDR.storage.path property

To enable the property and state the directory to be used:

mzsh topo set val:common.mz.preset.dupUDR.storage.path '/mydirectory/dupudr'


To disable the property:

mzsh topo unset val:common.mz.preset.dupUDR.storage.path

For further information about all available system properties, see 2.6 System Properties.

Max Cache Age (days)

The maximum number of days to keep UDRs in the cache. The age of a UDR stored in the cache is either calculated from the Indexing Field (timestamp) of a UDR in the latest processed batch file, or from the system time, depending on whether Based on System Arrival Time or  Based on Latest Time Stamp in Cache is selected.

If the Date Field option, below, is not selected as an indexing field, this field will be deactivated and ignored, and cache size can only be configured using the Max Cache Size settings. The default value is 30 days.

Note!

Duplicate checking is not performed if the processed UDRs are too old, this will be logged in the System Log. However, the age calculation cannot be performed if the cache is empty.

Based On System Arrival Time

When this radio button is selected (default), the calculation of cached UDR's age will be based on the time when a new batch is being processed.

In case of a longer system idle time, this setting may have a major impact on which UDRs are removed from the cache. For more information about the difference between Based on System Arrival Time and Based on Latest Time Stamp in Cache when calculating the UDR age, see the section below, Using Indexing Field Instead of System Time.

Based on Latest Time Stamp in Cache

When this radio button is selected, the UDR cache age calculation will be made toward the latest Indexing Field (timestamp) of a UDR that is included in the previously processed batch files.

For more information about the difference between Based on System Arrival Time and Based on Latest Time Stamp in Cache when calculating the UDR age, see the section below, Using Indexing Field Instead of System Time.

Max Cache Size (thousands)

The maximum number of UDRs to store in the duplicate cache. The value must be in the range of 100 - 9999999 (thousands), the default is 5000 (thousands). The cache will be made up of containers covering 50 seconds each, and for every incoming UDR, it will be determined in which cache container the UDR will be stored

During the initialization phase, the agent checks whether the cache is full or not. If the check indicates that there will be less than 10% of the cache available, cache containers will start to be cleared until 10% free cache is reached, starting with the oldest container. Depending on how many UDRs are stored in each container, this means that different amounts of UDRs may be cleared depending on the setup. If the index field happens to have the same value in all the UDRs, all of the UDRs in the cache will be cleared. 

Note!

If you have a very large cache size, it may be a good idea to split the workflows in order to preserve performance. 


Type

The UDR type the agent will process.

Indexing Field

The UDR field is used as an index in the duplicate comparison. Fields of type long and date are valid for selection.

For performance reasons, this field should preferably be either an increasing sequence number or a timestamp with good locality. This field will always be implicitly evaluated.

For further information, see the section below, Using Indexing Field instead of System Time.

Date Field

If selected (default), the indexing field will be treated as a timestamp instead of a sequence number, and this has to be selected to be able to set the maximum age of UDRs to keep in the cache in the Max Cache Age (days) field above.

Note!

If the selected indexing fi eld is a timestamp that is configured to be 24 h or more ahead of the system time, the workflow will abort. 

Checked Fields

The fields to use for the duplication evaluation, when deciding whether or not a UDR is a duplicate.

Note!

If the  Checked Fields or  Indexing Field are modified after an agent is executed, the already stored information will be considered useless the next time the workflow is activated. Hence, duplicates will never be found amongst the old information since another type of metadata has replaced them.

Using Indexing Field Instead of System Time

The "cache time window" (see the figure below) decides whether a UDR shall be removed from the cache or not. The maximum number of days to store a UDR in the cache is retrieved from the setting Max Cache Age (days) each time a new batch file is processed (and the age of the UDRs is calculated). The "cache time window" will be moved forward and old UDRs will be removed.

Calculation of the UDR age can be done in two ways:

  • Using the latest indexing field (timestamp) of a UDR that is included in the previously processed batch files.
  • Using system time.

The following figure illustrates the difference:

UDR removed from the cache based on indexing field or system time

If the system has been idle for an extended period of time, there will be a "delay" in time. So when a new batch file is processed, and if system time is used for UDR age calculation, the "cache time window" will be moved forward with the delay included, and this might result in all UDRs being removed from the cache, as shown in the figure above. The consequence of this is that the improperly removed UDRs will be considered as non-duplicates and, hence, might be handled even though they still are duplicates.

If the indexing field is used instead, a more proper calculation will be done, since the "system delay time" will be excluded. In this case only UDR 1 and UDR 2 will be removed.