6. Data Hub Example

In this example you will configure and run a Data Hub workflow that reads usage records in CSV format and uploads the data to Impala. The partition column is created based on a date string in the input data.

Follow these steps below to run the example on one machine that hosts both, Cloudera Manager, and CDH.

  1. Install Cloudera Manager and CDH. For this example you may use a VirtualBox or Docker image, which you may Download from www.cloudera.com.
  2. Open the file browser in Hue and create the staging directory, /user/clouder/uploads. Update the permissions on the directory to make it available to the UNIX user(s) that is used to start the ECs.
  3. Open the Impala query editor in Hue and create a database.

    CREATE DATABASE test;
  4. Create a table to be used with Data Hub.

    CREATE TABLE IF NOT EXISTS test.usage (
    orderdate STRING, userid BIGINT, productid INT, description STRING, volume INT) PARTITIONED BY 
    (partitiondate INT) STORED AS PARQUET;
  5. Open the Desktop and create the following Ultra configuration:

    external Usage: terminated_by('\n') {
        ascii orderDate: terminated_by(',');
    	ascii userId: long(base10), terminated_by(',');	
        ascii productId: int(base10), terminated_by(',');
        ascii description: terminated_by(',');	
    	ascii volume: int(base10), terminated_by('\n');
    };
    
    internal UsageInt {
        string orderDate;
        int partitionDate;
    	long userId;
    	int  productId;
    	string description;
    	int volume;
    };
    
    in_map Usage_inmap: external(Usage), internal(UsageInt) {          
            i:orderDate and e:orderDate;
            i:userId and e:userId;
            i:productId and e:productId;
            i:description and e:description;
            i:volume and e:volume;
    };
    
    decoder Usage_decoder : in_map(Usage_inmap);
  6. Create a Data Hub profile with the following settings (for VirtualBox/Docker image)

    In the Impala tab:

    Host: localhost
    Port: 21050
    Authentication: None


    In the HDFS tab:

    HDFS URI: localhost
    Staging Path: /user/cloudera/uploads
    MZ Temp Path: <Any path in the platform server>

     

  7. Click the Refresh button in the profile and select the database test.
  8. Click the Tables Mapping tab and then select the internal UDR format defined above (UsageInt).
  9. Select usage from the Table drop-down list
  10. Click the Auto Map button. The cells in the UDR Field column will be set. 
  11. Select YYYYMMDD from Date Hint in the partitionDate row.

  12. Download the CSV file INFILE01.csv .
     
  13. Create the following batch workflow configuration:
  14. Configure the Disk collection to read the downloaded input file.
  15. Configure the Decoder agent to use the decoder in the Ultra configuration above (Usage_decoder). 
     
  16. Add the following APL code to the Analysis agent.

    consume {
        date d;
        strToDate(d, input.orderDate,"yyyyMMddHHmm");  
        input.partitionDate = dateGetYear(d)*10000 + dateGetMonth(d)*100 + dateGetDay(d);   
        debug(input);
        udrRoute(input);
    }
  17. Configure the Data Hub agent to use the profile that you created in the previous steps.
     
  18. Run the workflow with debug enabled. When the batch is complete, you may use the Web UI to query the data.