Data aggregator - Configuration tab
The Data Aggregator is a processor function, meaning it operates on data as it passes through a stream, transforming it before forwarding it to the next function. The Data Aggregator configuration contains the following sections:
Group Fields: Define the fields the Data aggregator will use to group records during the aggregation process.
Aggregation Fields: Specify the fields on which the aggregation operations (sum, count, min, max, or average) will be performed.
Flush by: Configure how and when the aggregated data should be flushed (forwarded) to the next function in the stream.
Note!
A note on TTL (Time to Live) for aggregated data sessions: Aggregated sessions are stored for a maximum of 180 days. This means that if a session is not updated for 180 days, all the stored data from that session will be permanently deleted
Page navigation:
Configuration field | Description |
---|---|
Group fieldsNote! | |
Fields | Specify which fields will be used to group records together for aggregation. If two or more records have the same values in the fields you select, they will be grouped into the same session for processing. You must add one or more fields for the data aggregator to work properly. Example - Group user records by user field
In this case, Record 1 and Record 3 will be grouped in a session because they have the same user value (A), while Record 2 will be in a separate session with its own user value (B). If you’re using a COUNT operation, for example, the total for the session with user A will be 2 and the total for the session with user B will be 1. You can either type the field names manually or select them from the drop-down menu. |
Group based on date/time | This field allows you to specify a defined period for aggregation based on date and time fields. When selected, you can add period fields to group your data by specific time intervals. You must choose a field that contains date/time information, then select a period type (Minute, Hour, Day, Month, or Year) for that field. You can add multiple fields to create more granular time-based groupings. Example - Group based on date/time. To illustrate how the Group based on date/time checkbox works, let's consider an example similar to the user field grouping above, but incorporating date/time grouping: Let's say you have five records with timestamps:
If you group by user field and enable Group based on date/time with a Period type of "Hour", the records would be grouped in the following sessions:
In this case, even though there are multiple records for each user, they are split into separate groups based on the hour of their timestamps. If you were using a COUNT operation, the result would show a count of 1 for all sessions, except Sessions 3, which would show a count of 2. This grouping allows you to aggregate data not just by user but also by specific time periods, giving you more granular control over your data analysis. Note!
For example: |
Field | If you select to group the fields by date/ time you will specify each field and then select a time duration for each. |
Period | Select a time duration from the dropdown:
|
Interval | Available when 'Minute' is the selected period. Choose either a 15 or 30 minute interval. |
Target field | Specify the name of the output field that represents the grouping criteria. This is the field by which the data is grouped in the output. The system will add a default target field based on the Field and Period type. You can customize this field. Caution! |
Aggregation fields | |
Field | Specify the name of the field(s) on which the aggregation operation will be performed. |
Target field | Specify the name of the output field that contains the aggregated values. |
Operation | Select the aggregation operation from the drop-down menu, this will be applied to the chosen field. The available operations are grouped into three categories: Numeric, and General. Click + Add field to add more fields. Numeric:
General:
Date:
|
Flush by | |
Flush by | Select how and when to flush the aggregated data to the next function in the stream. The options are:
See below for more information on these options and how Flush by works. |
Flush by
The data Aggregator collects and processes data internally until it is "flushed." Flushing means the data is finalized and sent for further processing. For details see, https://infozone.atlassian.net/wiki/x/y4JLDg. There are three options in the Data aggregator configuration for how and when the data can be flushed.
Flush by ‘End of transaction’
This option flushes the aggregated data once a transaction is completed or a condition is met, even if the overall stream is still running, and it allows for more frequent, smaller data outputs. If the data is coming from multiple files each file's data is handled as an individual transaction within the stream. After processing each file's data and applying the aggregation logic, the results of that particular file are "flushed" immediately.
Flush by ‘End of stream’
In this case, no data is flushed until the entire stream has finished running or a condition is met. It continues to be aggregated throughout the entire duration of a stream. This will ensure that the output represents the entire stream's data.
Example - telecom billing system
You have four CSV files with data from different regions (North, South, East and West), each file contains 10,000 call records from the last hour. The data must be aggregated before it’s sent to the billing system. If you set the flush by to…
Flush by End of transaction - When the system finishes processing the first file (e.g., North region), the aggregated sessions with total calls, minutes, etc… are flushed immediately downstream for that file. The same behavior would occur for the other regions' files.
Flush by end of stream - The system would process all 40,000 records in the sessions before flushing the aggregated results downstream.
Note!
The Flush by options, End of transaction and End of stream, do not apply to real-time streams.
Flush by 'Timeout'
In this case, the system flushes the aggregated data after a specific timeout period has elapsed or a condition is met. This can be useful when the data should be output at regular intervals.
In batch streams, the timeout is passive and waits for the next stream execution to flush data
In real-time streams, the timeout is actively monitored, and the system automatically flushes the data every 60 seconds if the timeout has arrived.
Example of a timeout
“Record the sum of sheets of paper used by a subscriber to a printing service on the 10th of every month.”
In this case, the data is flushed strictly based on time—on the 10th of each month.
Timeout type
This can be one of the following options:
Timeout type | Description |
---|---|
Hour | Select the hour interval after which the data will timeout. Example- Timeout type with 1-hour intervals
For Account 'Y':
|
Day | Data is timed out daily at a specific time you set. You must specify both the exact time of day and the timezone when this should happen. You specify both the exact day of the month (e.g., the 1st or 15th) and the time of day when the timeout should occur, along with the timezone. Example - Timeout type set to ‘day’ |
Month | Data is timed out monthly at a specific time you set. You specify the exact day of the month (e.g., the 1st or 15th or ‘last day of the month’) and the time of day when the timeout should occur, along with the timezone. Example- Timeout type set to ‘month’ |
Based on timestamp field | This setting uses the event timestamp from the input data to determine when to flush the aggregated results. Note! Example - Timeout type set to ‘Based on timestamp field’ |
Flush by custom conditions
You can add custom conditions for when you want the flush to occur by clicking + Add condition.
Example - Custom condition added to Flush by
Scenario: Record the sum of sheets of paper used by a subscriber to a printing service on the 10th of every month or if the 70-page limit is reached. The 70 page limit condition is the custom condition.
The data will be flushed either when the 70 page limit is reached or on the 10th of the month, whichever happens first. This is what the custom condition would look like in the Data aggregator configuration:
OR condition configuration | Description |
---|---|
Based on | Select which input field or aggregated field you want to apply the condition on. The Input Fields show all the input fields configured in the stream and Aggregated Fields show the fields that you have selected to perform aggregation on. In this example, the only aggregated field is 'sum_sheets’. Input and Aggregated field options |
Type of field | Select the type of field. Your selection will determine the configuration options that follow this.
Note! |
Operator | Use this field to choose how you want to compare the selected Field with a specific value. The options available will depend on the the Type of field selected. Note! |
Value | The options for this field will change depending on the type of field selected:
|