Performance monitoring

To provide you information for planning the required capacities for your deployment, we monitored the Threat View performance. We used a fixed set of performance test parameters to describe the performance in a reference setup that can be applied to your production infrastructure.

The parameters and results described in this article were obtained in the performance tests executed with Threat View version 1.2.

In our reference setup, we analyzed the used computing resource and the used storage for the Events and Audit databases, and Apache Active MQ with the following focus:

The throughput, expressed as number of requests handled per second, and
The response time, expressed as number of milliseconds in which 95% of all requests were processed

The overall goal was to define the resource requirements to meet a peak load of 105 requests per second for data collection.

Setup and required infrastructure

The infrastructure used for our setup was built in Amazon Web Services (AWS), using an AWS Elastic Kubernetes Service (EKS) cluster with managed services to persist the data.

Resource	Number	Instance Type	vCPU	Memory GiB
Worker node¹	1	m4.xlarge²	16	64
Amazon MQ	2	mq.m5.large³	2	8
RDS Aurora Postgres	1	db.t3.medium⁴	2	4
RDS MariaDB	1	db.m7g.large⁴	2	8
RDS MariaDB	2	db.t3.micro⁴	2	1

¹Node is part of the EKS/Kubernetes cluster
²For more information, refer to Supported CPU options for Amazon EC2 instance types
³For more information, refer to Amazon MQ for ActiveMQ broker instance types
⁴For more information, refer to Hardware specifications for DB instance classes

Vertical / Horizontal Scaling

The assumption is that infrastructure can only be vertically scaled up or down by selecting different instance types. Variants without server might technically be using horizontal scaling, but generally these are less predictable with their performance. That is the reason they were not used in the tests and in turn are not a recommended option for your Threat View deployment.

Test run

The test run was carried out under the following conditions:

Started with a 2-minute warm-up run to give the Java pods time to settle after startup.
Immediately after that, the test plan was run for 10 minutes. The test plan sent an Application startup event.

This yielded the following results:

Test	Samples	Failed	Error %	Average	Min	Max	Median	90th pct	95th pct	99th pct	Transactions/s
collector-events	63000	0	0.00%	17.34	1	393	12	19	20	23	105.19

Result

Threat View can process 105 transactions per second where 95% of requests are processed in 20ms.

Helm charts

The Threat View zip file included in the product package contains a Helm chart which will automatically apply the following:

Minimum 2 pods per service for production stability
Maximum 1 pod for services that do not support horizontal scaling
This applies to the Data Processor and Auditlogger Worker services.
Requested CPU core-equivalents and memory capacities are set for each service
These capacities are set to meet requirements as per the performance profile of each service:
- The Data Processor service uses more CPU capacity than the Data Collector service and consequently requests more CPU capacity per pod.
- The Threat View Administration Interface, and the Identity Management, Tokens, and Visual Renderer services are not profiled and use a generic setting.
Autoscaling enabled
This allows Threat View to automatically scale up to meet peak load and automatically scale down during off-peak load. This is to ensure a certain trade-off between processing lower volumes and providing Java pods with the adequate amount of CPU request capacities to be compatible with standard autoscaling settings, because setting this too low could cause the autoscaler to be initiated too often.

For recommendations for a more cost-effective infrastructure, see Helm overrides.

Required compute and/or memory capacities

Service	Replicas	CPU requests	CPU requests * pods	Memory requests	Memory requests * pods
Administration Interface	2	0.5	1	128	256
Audit Logger	2	1	2	760	1520
Audit Logger Worker	1	1	1	760	760
Data Collector	2	1	2	760	1520
Data Processor	1	3	3	1024	1024
Identity Management	2	1	2	760	1520
Tokens	2	1	2	760	1520
Visual Renderer	2	1	2	760	1520
Total			15		9640

Accordingly, for Kubernetes you need a worker node with 15 CPU core-equivalents and 9.6GB memory.

Required storage

As Amazon AWS does not provide an option to set the storage limit when creating an Amazon MQ broker for ActiveMQ, no conclusive statements can be made about required storage for ActiveMQ.

MySQL Audit database

The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.

Schema	Table	Required storage per day	Required storage per month	Required storage for 3 months
audit	audit events	~17 GB	~527 GB	~1581 GB

Postgres Events database

The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.

Schema	Table	Required storage per day	Required storage per month	Required storage for 3 months
events	events	~8 GB	~244 GB	~733 GB
facts	facts	~3 GB	~21 GB	~21 GB
facts	aggregate_staging	~3 GB	~3 GB	~3 GB
facts	facts_to_process	~2 GB	~2 GB	~2 GB
facts	events_by_country_per_hour	< 1 MB	< 1MB	< 1MB
facts	events_by_device_model_per_hour	< 1MB	< 1MB	< 1MB
facts	events_by_os_per_hour	< 1MB	< 1MB	< 1MB
facts	events_by_country_per_day	< 1MB	< 1MB	< 1MB
facts	events_by_device_model_per_day	< 1MB	< 1MB	< 1MB
facts	events_by_os_per_day	< 1MB	< 1MB	< 1MB
facts	events_by_country_per_month	< 1MB	< 1MB	< 1MB
facts	events_by_device_model_per_month	< 1MB	< 1MB	< 1MB
facts	events_by_os_per_month	< 1MB	< 1MB	< 1MB
facts	device_models_dim	< 1MB	< 1MB	< 1MB
Total		~23 GB	~270 GB	~759 GB

Archiving

You can configure an archiving process running on a regular basis if you wish to retain the data before it is deleted by the configurable cleanup job.

By default the cleanup job is disabled so it must be enabled for archiving. The cleanup job drops old partitions according to your configurations. For instance, if you configure the lowest possible value, "1", the cleanup job will drop any partitions that are older than 1 month. This means, you need to configure your archiving process to work according to the settings of the cleanup job so that your data can be archived before the cleanup job deletes it.

Helm overrides

The standard Threat View setup exceeds a capacity of 105 transactions per second. It could be cost optimized, if required, like using cheaper infrastructure, and you can set Helm overrides if you wish to run Threat View more cost-effective:

To set Helm overrides

Opt out from using a replica set with 2 pods to running 1 pod per service instead of 2.
Changing from the default replica set with 2 pods to a set with 1 pod can impact the availability of the service.
To opt out, set the defaultReplicas value from 2 to 1 in your Helm overrides.
Opt out of auditing the Data Collector service. This would shrink the auditing requirements to a micro level and can be done by setting a blank value for AUDITLOGGER_SERVICE_URL on data-collector.
Opt out of autoscaling, then reconfigure resource requests of individual services to a lower value.