HDFS Collection Agent

The HDFS collection agent collects files from a HDFS, which is the primary distributed storage used by Hadoop applications, and inserts them into a workflow. A HDFS cluster primarily consists of a NameNode that manages the file system meta data, and DataNodes that store the actual data. Initially, the source directory is scanned for all files matching the current filter. In addition, the Filename Sequence and  Sort Order services may be used to further manage the matching of files, although they may not be used at the same time since it will cause the workflow to abort. All files found will be fed one after the other into the workflow.

When a file has been successfully processed by the workflow, the agent offers the possibility of moving, renaming, removing or ignoring the original file. The agent can also be configured to keep files for a set number of days. In addition, the agent offers the possibility of decompressing compressed (gzip) files after they have been collected. When all the files are successfully processed, the agent stops to await the next activation, whether it is scheduled or manually initiated.