Performance monitoring

Prev Next

To provide you information for planning the required capacities for your deployment, we monitored the Threat View performance. We used a fixed set of performance test parameters to describe the performance in a reference setup that can be applied to your production infrastructure.

The parameters and results described in this article were obtained in the performance tests executed with Threat View version 1.2.

In our reference setup, we analyzed the used computing resource and the used storage for the Events and Audit databases, and Apache Active MQ with the following focus:

  • The throughput, expressed as number of requests handled per second, and

  • The response time, expressed as number of milliseconds in which 95% of all requests were processed

The overall goal was to define the resource requirements to meet a peak load of 105 requests per second for data collection.

Setup and required infrastructure

The infrastructure used for our setup was built in Amazon Web Services (AWS), using an AWS  Elastic Kubernetes Service (EKS) cluster with managed services to persist the data.

Resource

Number

Instance Type

vCPU

Memory GiB

Worker node1

1

m4.xlarge2

16

64

Amazon MQ

2

mq.m5.large3

2

8

RDS Aurora Postgres

1

db.t3.medium4

2

4

RDS MariaDB

1

db.m7g.large4

2

8

RDS MariaDB

2

db.t3.micro4

2

1

1Node is part of the EKS/Kubernetes cluster
2For more information, refer to Supported CPU options for Amazon EC2 instance types
3For more information, refer to Amazon MQ for ActiveMQ broker instance types
4For more information, refer to Hardware specifications for DB instance classes

Vertical / Horizontal Scaling

The assumption is that infrastructure can only be vertically scaled up or down by selecting different instance types. Variants without server might technically be using horizontal scaling, but generally these are less predictable with their performance. That is the reason they were not used in the tests and in turn are not a recommended option for your Threat View deployment.

Test run

The test run was carried out under the following conditions:

  1. Started with a 2-minute warm-up run to give the Java pods time to settle after startup.

  2. Immediately after that, the test plan was run for 10 minutes. The test plan sent an Application startup event.

This yielded the following results:

Test

Samples

Failed

Error %

Average

Min

Max

Median

90th pct

95th pct

99th pct

Transactions/s

collector-events

63000

0

0.00%

17.34

1

393

12

19

20

23

105.19

Result

Threat View can process 105 transactions per second where 95% of requests are processed in 20ms.

Helm charts

The Threat View zip file included in the product package contains a Helm chart which will automatically apply the following:

  • Minimum 2 pods per service for production stability

  • Maximum 1 pod for services that do not support horizontal scaling

    This applies to the Data Processor and Auditlogger Worker services.

  • Requested CPU core-equivalents and memory capacities are set for each service

    These capacities are set to meet requirements as per the performance profile of each service:

    • The Data Processor service uses more CPU capacity than the Data Collector service and consequently requests more CPU capacity per pod.

    • The Threat View Administration Interface, and the Identity Management, Tokens, and Visual Renderer services are not profiled and use a generic setting.

  • Autoscaling enabled

    This allows Threat View to automatically scale up to meet peak load and automatically scale down during off-peak load. This is to ensure a certain trade-off between processing lower volumes and providing Java pods with the adequate amount of CPU request capacities to be compatible with standard autoscaling settings, because setting this too low could cause the autoscaler to be initiated too often.

For recommendations for a more cost-effective infrastructure, see Helm overrides.

Required compute and/or memory capacities

Service

Replicas

CPU requests

CPU requests * pods

Memory requests

Memory requests * pods

Administration Interface

2

0.5

1

128

256

Audit Logger

2

1

2

760

1520

Audit Logger Worker

1

1

1

760

760

Data Collector

2

1

2

760

1520

Data Processor

1

3

3

1024

1024

Identity Management

2

1

2

760

1520

Tokens

2

1

2

760

1520

Visual Renderer

2

1

2

760

1520

Total

15

9640

Accordingly, for Kubernetes you need a worker node with 15 CPU core-equivalents and 9.6GB memory.

Required storage

As Amazon AWS does not provide an option to set the storage limit when creating an Amazon MQ broker for ActiveMQ, no conclusive statements can be made about required storage for ActiveMQ.

MySQL Audit database

The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.

Schema

Table

Required storage per day

Required storage per month

Required storage for 3 months

audit

audit events

~17 GB

~527 GB

~1581 GB

Postgres Events database

The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.

Schema

Table

Required storage per day

Required storage per month

Required storage for 3 months

events

events

~8 GB

~244 GB

~733 GB

facts

facts

~3 GB

~21 GB

~21 GB

facts

aggregate_staging

~3 GB

~3 GB

~3 GB

facts

facts_to_process

~2 GB

~2 GB

~2 GB

facts

events_by_country_per_hour

< 1 MB

< 1MB

< 1MB

facts

events_by_device_model_per_hour

< 1MB

< 1MB

< 1MB

facts

events_by_os_per_hour

< 1MB

< 1MB

< 1MB

facts

events_by_country_per_day

< 1MB

< 1MB

< 1MB

facts

events_by_device_model_per_day

< 1MB

< 1MB

< 1MB

facts

events_by_os_per_day

< 1MB

< 1MB

< 1MB

facts

events_by_country_per_month

< 1MB

< 1MB

< 1MB

facts

events_by_device_model_per_month

< 1MB

< 1MB

< 1MB

facts

events_by_os_per_month

< 1MB

< 1MB

< 1MB

facts

device_models_dim

< 1MB

< 1MB

< 1MB

Total

~23 GB

~270 GB

~759 GB

Archiving

You can configure an archiving process running on a regular basis if you wish to retain the data before it is deleted by the configurable cleanup job.

By default the cleanup job is disabled so it must be enabled for archiving. The cleanup job drops old partitions according to your configurations. For instance, if you configure the lowest possible value, "1", the cleanup job will drop any partitions that are older than 1 month. This means, you need to configure your archiving process to work according to the settings of the cleanup job so that your data can be archived before the cleanup job deletes it.

Helm overrides

The standard Threat View setup exceeds a capacity of 105 transactions per second. It could be cost optimized, if required, like using cheaper infrastructure, and you can set Helm overrides if you wish to run Threat View more cost-effective:

To set Helm overrides

  • Opt out from using a replica set with 2 pods to running 1 pod per service instead of 2.

    Changing from the default replica set with 2 pods to a set with 1 pod can impact the availability of the service.

    To opt out, set the defaultReplicas value from 2 to 1 in your Helm overrides.

  • Opt out of auditing the Data Collector service. This would shrink the auditing requirements to a micro level and can be done by setting a blank value for AUDITLOGGER_SERVICE_URL on data-collector.

  • Opt out of autoscaling, then reconfigure resource requests of individual services to a lower value.