Handling erroneous and duplicated records

Handling erroneous and duplicated records

Collecting data from various sources can result in duplicate or erroneous records, which should be removed before processing the final data. In this stream, we use electricity usage data to demonstrate how to clean up such records using the Deduplicate and Validate functions, ensuring only accurate data is sent for billing purposes.

Functions used in this stream and their purpose:

  • Count - Counter that triggers the stream to run for the defined number of times.

  • Simulate Data (Script) - Simulates sample electricity usage data. This step substitutes real data input.

  • Deduplicate - Filters out duplicated records.

  • Validate - Filters out erroneous records. In this example, the records that do not have the kWhCharged column or invalid format of userTechnicalId will be filtered out. 

  • Log - Stores data in a log. This step substitutes data being sent for billing.

Note!

Validate Function

All records that are filtered by the Validate Function can be further processed and corrected with the help of Data Correction.