Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This example shows a batch processing setup where you collect files and perform duplication checks and aggregation. We want to make this solution scalable to improve the processing times of our data during periods of high usage. We need want to set up two to three workflows in our batch scaling solution. In this example, we use three.

...

I think we should add something here to explain what a partition is… also may be helpful to link to this Doc..Automatic Scale Out and Rebalancing (4.3).

Info

Partitions - From Chat GPT:

In software scaling, a partition is a way of breaking up large sets of data or tasks into smaller, more manageable parts. Each partition handles a subset of the total data or workload, which allows the system to process different parts at the same time, using multiple resources (like servers or processors). This makes the overall process faster and more efficient, especially as the data or workload grows.

For example, if you have a huge list of data to process, a partitioned system could split that list into sections and process each one in parallel, speeding up the work and allowing the system to handle more data without slowing down.

  1. The Scalable InterWF forwarder agent in the File collection workflow (s) manage the Inter workflow (InterWF) manages the partitions. They will use an ID Field (e.g. customer ID) to determine which partition a UDR belongs to.

  2. The maximum number of partitions created is determined by the Max Scale Factor parameter . This is configured in …in the Partition profile.

Note!

The number of partitions will be the same across all storage buckets/caches/ topics. The points of storage will occur, for example,

  • With the passing of UDRs between workflows.

  • When duplicate UDR keys are detected.

  • For aggregated sessions.

...

  1. The Aggregation workflow(s) will collect data from an inter-workflow topic and use a separate aggregation session storage topic.

Prerequisites for Kafka/batch scaling?

Are there any prerequisites required to be able to configure batch scaling using Kafka storage?Your workflow has to be designed in a way that can process batch workflows for example there has to be at least 1 common denominator in the data that links individual records.

Subsections

This section contains the following subsections:

you use the new agent InterWF Collector, to pick up the files from the external system/ IF storage (InterWF partition). You also need to have Duplication checks after which you will use the InterWF Forwarder to take the non-duplicated files and feed them to the Aggregation partitions on the data (pretty common processes in any workflow group) You will use the current agents Deduplicate and Data Aggregator, however, they will have a new storage profile option for Kafka, which you need to configure. Finally you would use the other new agent

Info

From Chat GPT re: Topics - For draft purposes only:

In a software context, especially in messaging and streaming platforms like Kafka, a topic isn’t a type of storage in the traditional sense, like a cache or database. Instead, it refers to a "channel" or "feed" where messages (like UDRs) are grouped and published for consumers to read from. While a topic involves data persistence (messages are stored temporarily or longer-term, depending on configuration), it's more about organizing and transmitting data rather than being a storage unit itself.

In comparison, a cache is a direct storage solution intended for fast access to data. Topics, on the other hand, are about managing and distributing data streams efficiently across systems.

Info

From chat with Michal:

How does the new solution differ from what users can configure now? The information on Automatic Scale Out and Rebalancing (4.3) is not related to batch scaling. It references Kafka doing some partitioning work based on what is configured in the Kafka agent. DRs new Batch scaling solution does the partitioning work within the inter-WF agents.

How does the new solution know when to scale? Is it based on the number of raw data files that get collected at any one time?  - right now you have to manually configure your ECD to scale based on a known metric i.e. if the data file amount is over 1000 files then

Look at the example image from the doc: 

is it the File collection workflow that creates the partitions?  not really, but it is sort of the scalable InterWF forwarding agent or as michal says - any agent using the Partition profile. 

It creates the partitions based on the Max Scale Factor paramater? True - says Michal - this will set the max number of parallel workflows as well. 

Where is the Max scale factor parameter located? In the Partition Profile configuration.

Our example shows 3 workflows - Does there have to be exactly 3 workflows in a solution? Is there a minimum/maximum amount of workflows needed to create a working solution? there is no maximum or minimum amount of workflows required. 

Are there any prerequisites required to be able to configure batch scaling using Kafka storage? yes -  your workflow has to be designed in a way that can process batch workflows for example there has to be at least 1 common denominator in the data that links individual records.