Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Data Hub provides the ability to store and query large amounts of data processed by the system.

Typical usage of Data Hub includes:

  • Data tracing
  • Temporary archiving
  • Analytics
  • Integration with external systems
  • Staging data for further processing 

Data Hub requires access to Cloudera Impala, which provides high-performance, low-latency SQL queries on data stored in an Hadoop filesystem (HDFS). 

The Data Hub Forwarding Agent bulk loads data in CSV files to HDFS and then inserts it into a Parquet table in the Impala database specified by a Data Hub Profile. The table data is then available for query via Data Hub Query

In a production environment, it is recommended that the size of the collected files ranges between 1 to 100 MB. Though it is possible to collect and process small batches the overhead of handling a large number of files will have significant impact on performance.

You may remove old data from the Impala database with the Data Hub Task Agent.

Prerequisites

The reader of this document should be familiar with:

The Data Hub forwarding agent is a batch agent that bulk loads data to an Impala database specified by a Data Hub profile. The Data Hub task agent is used to automatically remove old partitions from the database.

This section contains the following:




  • No labels