Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info
By default, when working with files, Usage Engine Cloud Edition will either successfully process an entire input file, or not process the input file at all. This is called transactional processing. This transactional processing helps ensure that data is neither lost nor duplicated.

Anchor
batchMultipleFiles
batchMultipleFiles
Batching multiple files per transaction

If you are processing many small files, the performance overhead of having one transaction per file can reduce performance.

It is possible to batch multiple files into a single transaction when collecting files with the Amazon S3 collector.

In the configuration:

  • Set the Transaction Batch Size to a value greater than 1 to process multiple files per transaction.

...

Note
titleNote!
The number of stream replicas must be chosen when you create your stream. It cannot be changed once set. This is because of persistent states in Cloud Edition such as aggregation sessions and de-duplication information is stored separately per replica.

...

Replicas and performance

When using replicas, the processing is done in parallel. If you partition your input files well, this can significantly speed up processing. This way of parallelizing processing is called horizontal scaling, or scaling out.

When not using replicas, a single stream instance processes all input files in sequence, as shown at the top of the illustration. The total stream execution time is the cumulative of the time taken for each file to be processed.

...

However, this almost never happens in real life. When using replicas you will see a reduction of stream execution time that is typically somewhere between 20% and 99%. Generally, the more replicas, the shorter the execution time, but you will eventually hit a point where adding more replicas don’t doesn’t reduce the stream execution time further.

This will depend on what your stream does. For example, if it requires a warmup phase to cache data, how evenly input files are partitioned , if it accesses external systems that introduce bottlenecks, and so on.

...

To do this, you need to create a copy of your stream with more replicas. Then you must ensure you flush out all persistent state states from the old running version - particularly aggregation sessions and de-duplication information. Then you can start processing data with the new stream version that has more partitions.

...

Replicas that don’t find any input files will not be executed for more than a brief moment, so a simple way is to just to stop sending input files to some replicas. If you do need to reduce the number of replicas, follow the same steps taken when increasing the number of replicas above.

Why can’t Usage Engine Cloud Edition scale

...

out and scale

...

in for me automatically?

We aim to provide elastic scaling in future versions of Usage Engine Cloud Edition. For now, the number of stream replicas is static and can’t be changed.