HTTP Batch Agent Configuration(3.3)
The HTTP Batch agent contains the following tabs:
- Connection
- Source
- Advanced
- Duplicate Check
Connection
HTTP batch collection agent configuration dialog - Connection tab
Field | Description |
---|---|
URL | URL to the file that will be collected, the full URL to a file must be given. Note! If collected file contains any links to other pages, these will only be followed if Index Based Collection is checked. Refer to Enable Index Based Collection in the the section below, Source. |
Username | HTTP authorization username used in requests |
Password | HTTP authorization password used in requests |
Source
HTTP batch collection agent configuration dialog - Source tab
Item | Description |
---|---|
Compression | Select if the agent should try to decompress the data collected before routing it into the workflow. The options are 'No Compression' and 'Gzip'. Note! |
Enable Index Based Collection | Select to Enable Index Based Collection. All linked-to URLs found in the HTML-formatted document will be collected. The URL is pointed out in the URL field in the section above, Connection. |
URL Pattern | Either leave empty or enter a regular expression filtering the full URL. If empty all files are collected, otherwise files matching the URL Pattern will be collected. The URL itself will not be routed into the workflow. |
Enable Control File Based Collection | When selected, the agent will only collect files with a control file present. The appearance of the control file is made by defining Position and the appearance of the expected control file. |
Position | The control filename consists of an extension added either before or after the shared filename part. There are two choices: Prefix or Suffix refer to the example below, Control File Extensions, for more information. |
Control File Extension | The Control File Extension is used to define when the data file should be collected. A data file will only be collected if the corresponding control file exists. The text entered in this field is the expected extension to the shared filename. The Control File Extension will be attached to the shared filename depending on the setting made in the Position field, refer to the example below, Control File Extensions, for more information. |
Data File Extension | The Data File Extension is an optional field that is used when a stricter definition of files to be collected is needed. It is only applicable if the Position is set to Suffix. Refer to the example below for more information. Example - Control File Extenstions Consider a directory containing 5 files:
|
Enable HTTP DELETE | Selecting this will issue the web server to delete the file and the control file after the file has been successfully collected. If unchecked the file will be ignored after collection, that is the file will be left in on the webserver. |
Advanced
HTTP batch collection agent configuration dialog - Advanced tab
Item | Description |
---|---|
Use Security Profile | Enable this option to allow the HTTP Batch agent to use HTTPS. |
Security Profile | Browse the Security profile for the HTTP Batch agent to use. |
Read Timeout (ms) | The maximum time, in milliseconds, to wait for response from the server. 0 (zero) means to wait forever |
Duplicate Check
HTTP batch collection agent configuration dialog - Duplicate Check tab
The Duplicate Check feature is only used when Enable Index Based Collection found in the section above, Source, is enabled.
Item | Description |
---|---|
Enable Duplicate Check | When selected, the agent will store every collected URL in a (configurable) number of days. The storage will be checked to make sure that no URL is collected again as long as it remains in the storage. |
Database Profile | Each collected URL will be stored in the database defined in the profile selected. The schema must contain a table called "duplicate_check", for more information about this table refer to HTTP Batch Appendix - Database Requirements for Duplicate Check(3.3). |
Max Cache Age (Days) | The number of days to keep collected URLs in the database. When the workflow starts, it will delete entries that are older than this number of days. Note! If a duplicate-check workflow runs on more than one EC on separate servers, and the system clocks are not synchronized, there is a risk that UDR duplicates are prematurely deleted. For example: If two system clocks are 12 hours apart and Max Cashed Age is set to 1 day, duplicate UDRs might be deleted after only 12 hours, instead of 24. |