HDFS Agents Preparations

Hadoop mzp Package

Note!

You are recommended to determine if the HDFS agents works right out of the box with your Hadoop setup to decide if you need to apply the steps below.

Apache Hadoop jar files and dependencies required by the HDFS agents are bundled in an mzp package and included during installation of the HDFS agents. The bundled Apache Hadoop package currently contains version 3.3.4 Hadoop jars as well as its version compatible dependencies. The included mzp package has been tested with Apache Hadoop version 3.3.4 setup and in most cases, the HDFS agents will work right out of the box without the need to apply the steps below to provide a custom Apache Hadoop package.

However, should you experience issues (e.g. getting NoClassDefFoundError or ClassNotFoundException) when running a HDFS agent workflow, it is possibly due to one of the following scenarios and that you may require a different set of Hadoop jar files and dependencies.

  • As there are several different distributions of Hadoop available, you may find compatibility issues when you are using a different distribution than the one that is available at hadoop.apache.org .

  • Your Hadoop setup is of a much earlier or later version compared to the version of the Hadoop jars in the bundled Apache Hadoop package.

To create and commit the Hadoop mzp package:

  1. Copy the set of jar files for the Hadoop version you want to use to the machine that MediationZone is running on.

    The set of jar files comprises hadoop-auth, hadoop-common, hadoop-hdfs, commons-collections and version compatible dependencies of these jars. You can refer to the set of jars used in the tested example below and include additional dependencies as needed. If your HDFS agents do not work with your custom Hadoop mzp package, contact support.

    Depending on the file structure, the files may be located in different folders, but typically they will be located in a folder called hadoop, or hadoop-common, where the hadoop-common.jar file is placed in the root directory, and the rest of the jar files are placed in a subdirectory called /lib.

  2. Set a variable called $FILES for all the different jars.

    This example shows how this is done for the Cloudera Distribution of Hadoop 4.

    FILES="-exported 3.1.0 file=hadoop-auth-3.1.0.jar \ -exported 3.1.0 file=hadoop-common-3.1.0.jar \ -exported 3.1.0 file=hadoop-hdfs-3.1.0.jar \ -exported 3.1.0 file=hadoop-aws-3.1.0.jar \ -exported 3.1.0 file=hadoop-annotations-3.1.0.jar \ file=hadoop-hdfs-client-3.1.0.jar \ file=stax2-api-3.1.4.jar \ file=commons-collections-3.2.2.jar \ file=htrace-core4-4.1.0-incubating.jar \ file=woodstox-core-5.0.3.jar \ file=commons-configuration2-2.1.1.jar \ file=httpclient-4.5.2.jar file=commons-logging-1.1.3.jar \ file=protobuf-java-2.5.0.jar \ file=guava-11.0.2.jar \ file=re2j-1.1.jar \ file=aws-java-sdk-bundle-1.11.271.jar"

     

  3. Create the mzp package:

    mzsh pcreate "Apache Hadoop" "<distribution>" apache_hadoop_cdh4.mzp -level platform -osgi true $FILES

    The following example shows how this could look like for the Cloudera Distribution of Hadoop 4.

    mzsh pcreate "Apache Hadoop" "4.4" apache_hadoop_cdh4.mzp -level platform -osgi true $FILES

     

  4. Commit the new package:

  5. Restart the Platform and ECs:

Hadoop with Kerberos

Kerberos Use

It is possible to use manually-created Kerberos tickets by using the kinit command. The UseGroupInformation class can access them from the ticket cache. In this case, the items cannot be auto-renewed. 

It is advised that you allow the system to handle user logins and ticket renewal as we do not recommend you manually create tickets.