Data Hub Example

In this example you will configure and run a Data Hub workflow that reads usage records in CSV format and uploads the data to Impala. The partition column is created based on a date string in the input data.

Follow these steps below to run the example on one machine that hosts both , Cloudera Manager, and CDP.

  1. Install Cloudera Manager and CDP. For this example you may use the Cloudera Public Cloud or Private Cloud Trial Installation, which you may Download from https://www.cloudera.com/
  2. Open the file browser in Hue and create the staging directory, /user/cloudera/uploads. Update the permissions on the directory to make it available to the UNIX user(s) that is used to start the ECs.
  3. Open the Impala query editor in Hue and create a database.

    CREATE DATABASE test;
  4. Create a table to be used with Data Hub.

    CREATE TABLE IF NOT EXISTS test.usage (
    orderdate STRING, userid BIGINT, productid INT, description STRING, volume INT) PARTITIONED BY 
    (partitiondate INT) STORED AS PARQUET
    TBLPROPERTIES ('transactional'='false');
  5. Open the Desktop and create the following Ultra configuration:

    external Usage: terminated_by('\n') {
        ascii orderDate: terminated_by(',');
    	ascii userId: long(base10), terminated_by(',');	
        ascii productId: int(base10), terminated_by(',');
        ascii description: terminated_by(',');	
    	ascii volume: int(base10), terminated_by('\n');
    };
    
    internal UsageInt {
        string orderDate;
        int partitionDate;
    	long userId;
    	int  productId;
    	string description;
    	int volume;
    };
    
    in_map Usage_inmap: external(Usage), internal(UsageInt) {          
            i:orderDate and e:orderDate;
            i:userId and e:userId;
            i:productId and e:productId;
            i:description and e:description;
            i:volume and e:volume;
    };
    
    decoder Usage_decoder : in_map(Usage_inmap);
  6. Create a Data Hub profile with the following settings:

    In the Impala tab:

    Hostlocalhost
    Port21050
    Authentication: None


    In the HDFS tab:

    HDFS URI: localhost
    Staging Path: /user/cloudera/uploads
    MZ Temp Path: <Any path in the platform server>

     

  7. Click the Refresh button in the profile and select the database test.
  8. Click the Tables Mapping tab and then select the internal UDR format defined above (UsageInt).
  9. Select usage from the Table drop-down list
  10. Click the Auto Map button. The cells in the UDR Field column will be set. 
  11. Select YYYYMMDD from Date Hint in the partitionDate row.

  12. Download the CSV file INFILE01.csv .
     
  13. Create the following batch workflow configuration:
  14. Configure the Disk collection to read the downloaded input file.
  15. Configure the Decoder agent to use the decoder in the Ultra configuration above (Usage_decoder). 
     
  16. Add the following APL code to the Analysis agent.

    consume {
        date d;
        strToDate(d, input.orderDate,"yyyyMMddHHmm");  
        input.partitionDate = dateGetYear(d)*10000 + dateGetMonth(d)*100 + dateGetDay(d);   
        debug(input);
        udrRoute(input);
    }
  17. Configure the Data Hub agent to use the profile that you created in the previous steps.
     
  18. Run the workflow with debug enabled. When the batch is complete, you may use the Data Hub Query to query the data.