KPI Management Scaling Considerations
This section describes details of the KPI Management implementation that you should take into account when deploying KPI Management in a production environment.
Operating System
The default maximum number of file descriptors and processes in your OS may be insufficient for the Spark cluster.
If the number of file descriptors is exceeded, it will be indicated by the error message "too many open files" in the Spark driver log. The procedure below describes how to increase the limit on a Linux system.
The error "java.lang.OutOfMemoryError: unable to create new native thread" indicates that the maximum number of processes is exceeded. To increase this number, you typically need to change the settings in the file /etc/security/limits.conf
. For further information about how to update this file, see your operating system documentation.
Spark
To scale out the Spark applications, you can increase the number of slave hosts for the Spark cluster or increase the number of CPUs on existing hosts.
Follow these steps to scale out the Spark applications:
Stop the KPI Management workflows.
Shut down the Spark cluster.
Shutdown all Kafka and Zookeepers.
Update the Spark service configuration. If you are not only adding CPUs on existing host but also new slave hosts, you need to add these to the configuration.
Increase the number of Kafka partitions. In order for the Spark service to work, the required number of partitions for each topic must be equal to the setting of the property
spark.default.parallelism
in the Spark application configuration.Remove the checkpoint directory.
$ rm -rf <checkpoint directory>/*
Restart Kafka and Zookeeper.
Submit the Spark application.
Start the KPI Management workflows.
ZooKeeper
The Spark application connects to the ZooKeeper service and you must ensure that the maximum number of client connections is not exceeded since this will cause errors during the processing. The default maximum number of connections is 60 per client host. The required number of connections directly corresponds to the total number of running Spark executors.