1. Data Hub Overview
Data Hub provides the ability to store and query large amounts of data processed by .
Typical usage of Data Hub includes:
- Data tracing
- Temporary archiving
- Analytics
- Integration with external systems
- Staging data for further processing
Data Hub requires access to Cloudera Impala, which provides high-performance, low-latency SQL queries on data stored in an Hadoop filesystem (HDFS).
The Data Hub agent bulk loads data in CSV files to HDFS and then inserts it into a Parquet table in the Impala database. The table data is then available for query via the Web UI.
In a production environment, it is recommended that the size of the collected files ranges between 1 to 100 MB. Though it is possible to collect and process small batches the overhead of handling a large number of files will have significant impact on performance.
You may remove old data from the Impala database with the Data Hub task agent.