HTTP Batch Appendix - Database Requirements for Duplicate Check

The Duplicate Check feature stores the collected URLs in an external database pointed out by a Database profile. The schema of this database must contain a table definition that matches the needs of the agent.

Table and Column Names

The schema table name must be "duplicate_check". It must contain all the columns from this table:

Table column

Description

txn

The transaction id of the batch that collected the URL (in the case the file is split into several chunks using hintEndBatch, it is the last and final transaction id.)

tstamp

The timestamp when the URL was committed by the workflow.

workflow_key

A uniquely identifying id of the workflow collecting the URL. It allows workflows to be renamed without changing the table data.

url

The full absolute URL collected.


Column Types

The column types are defined by how the specific JDBC driver converts JDBC types to the database.

  • The txn column is a JDBC VARCHAR.

  • The tstamp column is a JDBC TIMESTAMP type.

  • The workflow_key and url columns are of JDBC VARCHAR type.

Oracle Example

Oracle Example

<![CDATA[-- Table definition usable for ORACLE

CREATE TABLE duplicate_check(
   txn long,
   tstamp timestamp,
   workflow_key varchar2(32),
   url varchar2(256)
);
]]>