...
Follow these steps below to run the example on one machine that hosts both , Cloudera Manager, and CDH.
- Install Cloudera Manager and CDH. For this example you may use a VirtualBox or Docker image, which you may Download from www.cloudera.com.
- Open the file browser in Hue and create the staging directory, /user/clouder/uploads. Update the permissions on the directory to make it available to the UNIX user(s) that is used to start the ECs.
Open the Impala query editor in Hue and create a database.
Code Block CREATE DATABASE test;
Create a table to be used with Data Hub.
Code Block CREATE TABLE IF NOT EXISTS test.usage ( orderdate STRING, userid BIGINT, productid INT, description STRING, volume INT) PARTITIONED BY (partitiondate INT) STORED AS PARQUET;
Open the Desktop the Desktop and create the following Ultra configuration:
Code Block external Usage: terminated_by('\n') { ascii orderDate: terminated_by(','); ascii userId: long(base10), terminated_by(','); ascii productId: int(base10), terminated_by(','); ascii description: terminated_by(','); ascii volume: int(base10), terminated_by('\n'); }; internal UsageInt { string orderDate; int partitionDate; long userId; int productId; string description; int volume; }; in_map Usage_inmap: external(Usage), internal(UsageInt) { i:orderDate and e:orderDate; i:userId and e:userId; i:productId and e:productId; i:description and e:description; i:volume and e:volume; }; decoder Usage_decoder : in_map(Usage_inmap);
Create a Data Hub profile with the following settings (for VirtualBox/Docker image)
In the Impala tab:
Host: localhost
Port: 21050
Authentication: NoneIn the HDFS tab:
HDFS URI: localhost
Staging Path: /user/cloudera/uploads
MZ Temp Path: <Any path in the platform server>- Click the Refresh button in the profile and select the database test.
- Click the Tables Mapping tab and then select the internal UDR format defined above (
UsageInt
). - Select usage from the Table drop-down list
- Click the Auto Map button. The cells in the UDR Field column will be set.
Select YYYYMMDD from Date Hint in the partitionDate row.
- Download the CSV file
INFILE01.csv
.
- Create the following batch workflow configuration:
- Configure the Disk collection to read the downloaded input file.
- Configure the Decoder agent to use the decoder in the Ultra configuration above (
Usage_decoder
).
Add the following APL code to the Analysis agent.
Code Block consume { date d; strToDate(d, input.orderDate,"yyyyMMddHHmm"); input.partitionDate = dateGetYear(d)*10000 + dateGetMonth(d)*100 + dateGetDay(d); debug(input); udrRoute(input); }
- Configure the Data Hub agent to use the profile that you created in the previous steps.
Run the workflow with debug enabled. When the batch is complete, you may use the Web UI to query the data.