/
Data Hub Functions

Data Hub Functions

Data Hub allows MediationZone to integrate into Big Data solutions, by allowing the agent to forward large amounts of data to be stored in data lakes or data storage. Data Hub leverages Cloudera Data Platform (CDP) for storage access along with HDFS and Apache Impala.

It also allows for any data manipulation, such as enrichment, formatting, normalization and correlation of data to be done before the data is sent into the big data storage. The stored data can then be made accessible and searched for using the Data Hub web UI.

Data Hub task agent will allow for removal of old records by way of partitions in Impala. The removal of old data will be based on the date and time set in the task agent, and then have that value correlate with the values stored in the table partition. Currently all testing was done on CDP version 7.1.7.

Overview of Data Hub solution

DataHub-Overview.png

Data Hub Profile

The Data Hub Profile contains settings for Impala connection details, selecting databases, HDFS details as well as advanced configurations for LDAP and Kerberos.


Example of Data Hub profile

Currently, Data Hub supports connection to Cloudera with LDAP, Kerberos or both. The table below will indicate how we will support the LDAP and Kerberos authentication in the Cloudera framework.



LDAP

Kerberos

No Authentication



LDAP

Kerberos

No Authentication

Impala

Supported

Supported

Supported

HDFS



Supported

Supported

Mapping of the UDR's to the designated table in Impala is performed once the table name is selected from the Database section of the Impala Tab. The mapping can be done automatically if the field names in the ultra decoder matches the one in the Impala table.

Example of Data Hub Profile - Tables Mapping Tab

The Data Hub forwarding agent will be able to send any UDRs collected from any particular source and connect to the Impala table to insert and commit the UDRs. A standard Data Hub Forwarding workflow will like involve a collection agent, an analysis or aggregation agent to perform the UDR enrichment, followed by the Data Hub forwarding agent itself to output the UDRs into Impala.

Example of Workflow with the Data Hub forwarding agent

The Data Hub forwarding agent will perform the following series of tasks when initiated by the workflow. A temporary CSV file is created locally in MediationZone, where once the file is complete, it will be uploaded to the HDFS staging area. Following from that, the JDBC driver will call the Impala database to load the CSV file into the designated parquet table. It is only after the contents of the file is fully committed inside the table, does the agent remove the temporary file. Any workflow aborts will result in the temp file existing locally in MediationZone, much like the standard behavior of most forwarding agents.

Data Hub Forwarding Agent process

You can use the Data Hub profile to select the table that should be available for query or export data in one of the Impala tables specified in the Data Hub profile, without any knowledge about SQL.

Example of Data Hub UI