In this example you will configure and run a Data Hub workflow that reads usage records in CSV format and uploads the data to Impala. The partition column is created based on a date string in the input data.
Follow these steps below to run the example on one machine that hosts both, Cloudera Manager, and CDH.
- Install Cloudera Manager and CDH. For this example you may use a VirtualBox or Docker image, which you may Download from www.cloudera.com.
- Open the file browser in Hue and create the staging directory, /user/clouder/uploads. Update the permissions on the directory to make it available to the UNIX user(s) that is used to start the ECs.
Open the Impala query editor in Hue and create a database.
CREATE DATABASE test;
Create a table to be used with Data Hub.
CREATE TABLE IF NOT EXISTS test.usage ( orderdate STRING, userid BIGINT, productid INT, description STRING, volume INT) PARTITIONED BY (partitiondate INT) STORED AS PARQUET;
Open the Desktop and create the following Ultra configuration:
external Usage: terminated_by('\n') { ascii orderDate: terminated_by(','); ascii userId: long(base10), terminated_by(','); ascii productId: int(base10), terminated_by(','); ascii description: terminated_by(','); ascii volume: int(base10), terminated_by('\n'); }; internal UsageInt { string orderDate; int partitionDate; long userId; int productId; string description; int volume; }; in_map Usage_inmap: external(Usage), internal(UsageInt) { i:orderDate and e:orderDate; i:userId and e:userId; i:productId and e:productId; i:description and e:description; i:volume and e:volume; }; decoder Usage_decoder : in_map(Usage_inmap);
Create a Data Hub profile with the following settings (for VirtualBox/Docker image)
In the Impala tab:
Host: localhost
Port: 21050
Authentication: NoneIn the HDFS tab:
HDFS URI: localhost
Staging Path: /user/cloudera/uploads
MZ Temp Path: <Any path in the platform server>- Click the Refresh button in the profile and select the database test.
- Click the Tables Mapping tab and then select the internal UDR format defined above (
UsageInt
). - Select usage from the Table drop-down list
- Click the Auto Map button. The cells in the UDR Field column will be set.
Select YYYYMMDD from Date Hint in the partitionDate row.
- Download the CSV file
INFILE01.csv
.
- Create the following batch workflow configuration:
- Configure the Disk collection to read the downloaded input file.
- Configure the Decoder agent to use the decoder in the Ultra configuration above (
Usage_decoder
).
Add the following APL code to the Analysis agent.
consume { date d; strToDate(d, input.orderDate,"yyyyMMddHHmm"); input.partitionDate = dateGetYear(d)*10000 + dateGetMonth(d)*100 + dateGetDay(d); debug(input); udrRoute(input); }
- Configure the Data Hub agent to use the profile that you created in the previous steps.
Run the workflow with debug enabled. When the batch is complete, you may use the Web UI to query the data.