File System (3.0)
The File System Profile is used for making file system specific configurations, currently used by: The configurations will vary depending on the selected file system, and each file system will be described separately below. The External Reference button is specific for the File System profile configurations. External References Select this menu item to enable the use of External References in the File System profile configuration. This can be used to configure the following fields: Amazon S3 file systems GCP Storage file systems HDFS file systems When selecting Amazon S3 as file system, you will see two tabs; General and Advanced. The following settings are available in the General tab in the Amazon S3 File System profile: File System Type Access Key Enter the access key for the user who owns the Amazon S3 account in this field. If you want to set a parameter, select the Parameterized checkbox and enter the parameter name using ${} syntax, see Profiles(3.0) for more information on how parameterization works (in this mode the regular access key field is disabled). Secret Key Enter the secret key for the stated access key in this field. If you want to set a parameter, select the Parameterized checkbox and enter the parameter name using ${} syntax, see Profiles(3.0) for more information on how parameterization works (in this mode the regular secret key field is disabled). Bucket Enter the name of the Amazon S3 bucket in this field. In the Advanced tab, you can configure properties for the Amazon S3 File System client. For information on how to configure the properties for Amazon S3 File System client, please refer to https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl. When selecting GCP Storage as file system, you will see the tab General. GCP Storage File System - General Tab The following settings are available when you have selected Use Json File as the Input Option in the GCP Profile. GCP Profile - Use Json File configuration When is deployed in the GCP environment, such as in Compute Engine. You can enable this option to allow to retrieve the Service Account credentials provided by the environment. Allows you to select the method for connecting to the GCP service. For Use JSON File, you need to create the GCP Service Account Key as a JSON file and download it into the Platform and EC servers. The location of the GCP Service Account JSON file containing the credential keys. Note! The JSON file option is not recommended for production deployments. It is meant to facilitate ease of testing of the GCP Profile by the workflow designer during development. The following settings are available when you have selected Form as the Input Option in the GCP Profile. GCP Profile - Form configuration When is deployed in the GCP environment, such as in Compute Engine, you can enable this option to allow to retrieve the Service Account credentials provided by the environment. The GCP Project Id that will host the GCP service that will access. Location Bucket Select the checkbox and then choose an existing GCP Profile if the Authentication Details should be derived from a GCP Profile instead of adding them directly in this profile. When selecting HDFS as file systems, you will see two tabs; General and Advanced. The following settings are available in the General tab in the HDFS File System profile: File System Type Select the type of Hadoop from the drop-down box: Host Enter the IP address or hostname of the NameNode in this field. See the Apache Hadoop Project documentation for further information about the NameNode. Port Enter the port number of the NameNode in this field. The Advanced tab contains Advanced Properties for the configuration of Kerberos authentication. Kerberos is an authentication technology that uses a trusted third party to authenticate one service or user to another. Within Kerberos, this trusted third party is commonly referred to as the Key Distribution Center, or KDC. For HDFS, this means that the HDFS agent authenticates with the KDC using a user principal which must be pre-defined in the KDC. The HDFS cluster must be set up to use Kerberos, and the KDC must contain service principals for the HDFS NameNodes. For information on how to set up a HDFS cluster with Kerberos, see the Hadoop Users Guide at http://www.hadoop.apache.org. In order to perform authentication towards the KDC without a password, the HDFS agent requires a keytab file. You can set the advanced properties in the Advanced Properties dialog to activate and configure Kerberos authentication. The following advanced properties are related to Kerberos authentication. Refer to the Advanced Properties dialog for examples. Set the value to Note! Due to limitations in the Apache Hadoop client libraries, if you change this property, you may be required to restart the ECs where workflows containing the HDFS agent is going to run. The following properties are also included in the Advanced tab, but only apply if you have selected the HA version of Hadoop in the General tab: Note! If you are using Kerberos authentication, it is recommended that you only run the HDFS agents toward one HDFS cluster per EC. This is because the Kerberos client library of HDFS relies on static properties and configurations that are global for the whole JVM. This means that one workflow running the HDFS agents could impact another workflow running the HDFS agents within the same EC process. Due to this limitation, you must also restart the EC for some configuration changes to the Advanced Properties. Create a properties file containing the advanced configurations. Example - Properties file with advanced configurations Note! All "=" characters need to be escaped.Configuration
Menus
Item Description Amazon S3
General Tab
Setting Description Select which file system type this profile should be applied for. You can choose either Amazon S3 or HDFS. Credentials from Environment Select this check box in order to pick up the credentials from the environment instead of entering them in this profile. If this checkbox is selected, the Access Key and Secret Key fields will be disabled. Region from Environment Select this check box in order to pick up the region from the environment instead of entering the region in this profile. If this check box is selected, the Region field will be disabled. Region Enter the name of the Amazon S3 region in this field. Advanced Tab
GCP Storage
Json File
Setting Description Environment-Provided Service Account Input Option Credentials File Form
Setting Description Environment-Provided Service Account Import Credentials from File Click this button to import credentials from a GCP Service Account JSON file containing the credential keys. The credentials will then be populated in the below mentioned fields. Input Option Allows you to select the method for connecting to the GCP service. For Form, the GCP Profile will take the role of the Service Account Key file. It will parse all the credentials in order to connect to the GCP service. Project Id Private Key Id The Private Key Id to be used for the service account. Private Key The full content of the private key. Client Email The E-mail address given to the service account. Client Id The ID for the service account client. Other Information The Auth URI, Token URI and info about the certs are to be added into this field. Field Description Enter the name of the GCP Storage bucket in this field. Use GCP Profile HDFS
General Tab
Field Description Select which file system type this profile should be applied for. You can choose either Amazon S3 or HDFS. Hadoop Mode Replication Enter the number for HDFS to configure the replication factor. Replication is used for fault tolerance and more information regarding replication be found in: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication Advanced Tab
Property Description hadoop.security.authentication
kerberos
to activate Kerberos authentication.dfs.namenode.kerberos.principal
This sets the service principal to use for the HDFS NameNode. This must be predefined in the KDC. The service principal is expected to be in the form of nn/<host>@<REALM>
where <host>
is the host where the service is running and <REALM>
is the name (in uppercase) of the Kerberos realm.java.security.krb5.kdc
This specifies the hostname of the Key Distribution Center. java.security.krb5.realm
This sets the name of the Kerberos realm. Uppercase only. dr.kerberos.client.keytabfile
This sets the keytab file to use for authentication. A keytab must be predefined using Kerberos tools. The keytab must be generated for the user principal in dr.kerberos.client.principal
. This filepath must be on a file system that can be reached from the EC process. The user that launches the EC must also have read permissions for this file.dr.kerberos.client.principal
This sets the user principal that the HDFS agent authenticates as. This must be predefined in the KDC. User principals are expected to be in the form of <user>@<REALM>
where <user>
is typically a username and <REALM>
is the name (in uppercase) of the Kerberos realm.sun.security.krb5.debug
Set this value to true
to activate debug output for Kerberos.Property Description
fs.defaultFS
This sets the HDFS filesystem path prefix.
dfs.nameservices
This sets the logical name for the name services.
dfs.ha.namenodes.<nameservice ID>
This sets the unique identifiers for each NameNode in the name service.
dfs.namenode.rpc-address.<nameservice ID>.<name node ID>
This sets the fully-qualified RPC address for each NameNode to listen on.
dfs.namenode.http-address.<nameservice ID>.<name node ID>
This sets the fully-qualified HTTP address for each NameNode to listen on.
dfs.client.failover.proxy.provider.<nameservice ID>
This sets the Java class that HDFS clients use to contact the Active NameNode.
The Advanced Properties can also be configured using External References by following these steps:
ADV_PROP=hadoop.security.authentication\=kerberos\n\
java.security.krb5.kdc\=kdc.example.com\n\
dr.kerberos.client.principal\=mzadmin@EXAMPLE.COM\n\
dr.kerberos.client.keytabfile\=/home/mzadmin/keytabs/ex.keytab