Data Hub(4.3)

Data Hub(4.3)

Data Hub provides the ability to store and query large amounts of data processed by the system.

Typical usage of Data Hub includes:

  • Data tracing

  • Temporary archiving

  • Analytics

  • Integration with external systems

  • Staging data for further processing 

Data Hub requires access to Cloudera Impala, which provides high-performance, low-latency SQL queries on data stored in an Hadoop filesystem (HDFS). 

The https://infozone.atlassian.net/wiki/spaces/UEPE4D/pages/304354170 bulk loads data in CSV files to HDFS and then inserts it into a Parquet table in the Impala database specified by a https://infozone.atlassian.net/wiki/spaces/UEPE4D/pages/304403690 profile. The table data is then available for query via https://infozone.atlassian.net/wiki/spaces/UEPE4D/pages/304354338.

In a production environment, it is recommended that the size of the collected files ranges between 1 to 100 MB. Though it is possible to collect and process small batches the overhead of handling a large number of files will have significant impact on performance.

You may remove old data from the Impala database with the https://infozone.atlassian.net/wiki/spaces/UEPE4D/pages/304354312.

Prerequisites

The reader of this document should be familiar with:

 

Error rendering macro 'scroll-ignore' : Page loading failed

 

Error rendering macro 'scroll-pagebreak' : Page loading failed