Handling Erroneous And Duplicated Records

Collecting data from various sources may result in duplicate or erroneous records, and you would like to remove such records before processing the final data. This stream uses the electricity usage example to show how to clean up such erroneous or duplicate data, before sending the data for billing purposes, by using the Deduplicate and Validate functions.

 

Functions used in this stream and their purpose:

  • Count - Counter that triggers the stream to run for the defined number of times.

  • Simulate Data (Script) - Simulates sample electricity usage data. This step substitutes real data input.

  • Deduplicate - Filters out duplicated records.

  • Validate - Filters out erroneous records. In this example, the records that do not have the kWhCharged column or invalid format of userTechnicalId will be filtered out. 

  • Log - Stores data in a log. This step substitutes data being sent for billing.

 

Validate Function

All records that are filtered by the Validate Function can be further processed and corrected with the help of Data Correction.

Â