...
Overview
Usage Engine Private Edition now supports horizontal scaling of batch scaling, making it possible to increase or decrease workflows, increasing or decreasing processing capacity as needed without manual intervention. As a general concept, batch scaling is a way to speed up processes by splitting the workload between multiple “workers” or resources‘workers,’ enabling them to complete tasks in parallel rather than sequentially. Usage Engine’s solution consists of two new agents, a Scalable InterWF Forwarder Inter Workflow Forwarding agent and a Scalable InterWF Collector agent. A new profile has Inter Workflow Collection agent (Scalable InterWF). Two new profiles have also been created - the Partition Profile and the Scalable Inter Workflow Profile. The feature uses the existing agents, Data Aggregator and DeduplicationDuplicate UDR, which have been updated to include support a Kafka storage profiletype. Kafka must be configured for all storage within your scalable batch scaling solution. Add something here about recommended use cases as per the note above?
How it works
Scalable workflows operate by splitting batch data into partitions so that multiple workflows can cooperate to process a batch. Each scaled workflow is assigned one or more partitions and will process all the data assigned to them. When workflows are started or stopped, a rebalance is performed where partitions are reassigned to the new set of workflows.
This example shows a batch processing setup where you collect files and perform duplication checks and aggregation. We want to make this solution scalable to improve the processing times of our data during periods of high usage. We want to set up three have set up two workflows in our batch scaling solution.
...
I think we should add something here to explain what a partition is… also may be helpful to link to this Doc..Automatic Scale Out and Rebalancing (4.3).
...
For example, if you have a huge list of data to process, a partitioned system could split that list into sections and process each one in parallel, speeding up the work and allowing the system to handle more data without slowing down.
...
In the File collection workflow the Scalable InterWF Forwarding agent sends data to the partitions. It uses one or more unique ID Fields (e.g. customer ID) to determine which partition a UDR belongs to.
The maximum number of partitions created is determined by the Max Scale Factor parameter in the Partition profileProfile.
Note!
The number of partitions will be the same across all topics. The points of storage will occur, for example,
With the passing of UDRs between workflows.
When duplicate UDR keys are detected.
For aggregated sessions.
The Duplication Check workflow(s) will check for duplicates across all partitions. Checked UDRs are placed in an additional topic with the same partitions as the corresponding Collection workflow topic. (The Duplicate keys are saved in a separate topic with the same number of partitions having the same ID fields.)
The Aggregation workflow(s) will collect data from an inter-workflow topic and use a separate aggregation session storage topic.
Prerequisites for Kafka/batch scaling?
...
Processing workflow isthe workflow that scales, that is, you can run from one up to the Max Scale Factor of WFs that will cooperate to do the processing. In this example, records go through a duplication check and are aggregated. Persistent storage for Duplicate UDR check and aggregation is also partitioned.
Subsections
This section contains the following subsections:
Child pages (Children Display) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
...
In comparison, a cache is a direct storage solution intended for fast access to data. Topics, on the other hand, are about managing and distributing data streams efficiently across systems.
...
From chat with Michal:
How does the new solution differ from what users can configure now? The information on Automatic Scale Out and Rebalancing (4.3) is not related to batch scaling. It references Kafka doing some partitioning work based on what is configured in the Kafka agent. DRs new Batch scaling solution does the partitioning work within the inter-WF agents.
How does the new solution know when to scale? Is it based on the number of raw data files that get collected at any one time? - right now you have to manually configure your ECD to scale based on a known metric i.e. if the data file amount is over 1000 files then…
Look at the example image from the doc:
is it the File collection workflow that creates the partitions? not really, but it is sort of the scalable InterWF forwarding agent or as Michal says - any agent using the Partition profile.
It creates the partitions based on the Max Scale Factor paramater? True - says Michal - this will set the max number of parallel workflows as well.
Where is the Max scale factor parameter located? In the Partition Profile configuration.
Our example shows 3 workflows - Does there have to be exactly 3 workflows in a solution? Is there a minimum/maximum amount of workflows needed to create a working solution? there is no maximum or minimum amount of workflows required.
...