To provide you information for planning the required capacities for your deployment, we monitored the Threat View performance. We used a fixed set of performance test parameters to describe the performance in a reference setup that can be applied to your production infrastructure.
The parameters and results described in this article were obtained in the performance tests executed with Threat View version 1.2.
In our reference setup, we analyzed the used computing resource and the used storage for the Events and Audit databases, and Apache Active MQ with the following focus:
The throughput, expressed as number of requests handled per second, and
The response time, expressed as number of milliseconds in which 95% of all requests were processed
The overall goal was to define the resource requirements to meet a peak load of 105 requests per second for data collection.
Setup and required infrastructure
The infrastructure used for our setup was built in Amazon Web Services (AWS), using an AWS Elastic Kubernetes Service (EKS) cluster with managed services to persist the data.
Resource | Number | Instance Type | vCPU | Memory GiB |
|---|---|---|---|---|
Worker node1 | 1 | m4.xlarge2 | 16 | 64 |
Amazon MQ | 2 | mq.m5.large3 | 2 | 8 |
RDS Aurora Postgres | 1 | db.t3.medium4 | 2 | 4 |
RDS MariaDB | 1 | db.m7g.large4 | 2 | 8 |
RDS MariaDB | 2 | db.t3.micro4 | 2 | 1 |
1Node is part of the EKS/Kubernetes cluster
2For more information, refer to Supported CPU options for Amazon EC2 instance types
3For more information, refer to Amazon MQ for ActiveMQ broker instance types
4For more information, refer to Hardware specifications for DB instance classes
Vertical / Horizontal Scaling
The assumption is that infrastructure can only be vertically scaled up or down by selecting different instance types. Variants without server might technically be using horizontal scaling, but generally these are less predictable with their performance. That is the reason they were not used in the tests and in turn are not a recommended option for your Threat View deployment.
Test run
The test run was carried out under the following conditions:
Started with a 2-minute warm-up run to give the Java pods time to settle after startup.
Immediately after that, the test plan was run for 10 minutes. The test plan sent an Application startup event.
This yielded the following results:
Test | Samples | Failed | Error % | Average | Min | Max | Median | 90th pct | 95th pct | 99th pct | Transactions/s |
|---|---|---|---|---|---|---|---|---|---|---|---|
collector-events | 63000 | 0 | 0.00% | 17.34 | 1 | 393 | 12 | 19 | 20 | 23 | 105.19 |
Result
Threat View can process 105 transactions per second where 95% of requests are processed in 20ms.
Helm charts
The Threat View zip file included in the product package contains a Helm chart which will automatically apply the following:
Minimum 2 pods per service for production stability
Maximum 1 pod for services that do not support horizontal scaling
This applies to the Data Processor and Auditlogger Worker services.
Requested CPU core-equivalents and memory capacities are set for each service
These capacities are set to meet requirements as per the performance profile of each service:
The Data Processor service uses more CPU capacity than the Data Collector service and consequently requests more CPU capacity per pod.
The Threat View Administration Interface, and the Identity Management, Tokens, and Visual Renderer services are not profiled and use a generic setting.
Autoscaling enabled
This allows Threat View to automatically scale up to meet peak load and automatically scale down during off-peak load. This is to ensure a certain trade-off between processing lower volumes and providing Java pods with the adequate amount of CPU request capacities to be compatible with standard autoscaling settings, because setting this too low could cause the autoscaler to be initiated too often.
For recommendations for a more cost-effective infrastructure, see Helm overrides.
Required compute and/or memory capacities
Service | Replicas | CPU requests | CPU requests * pods | Memory requests | Memory requests * pods |
|---|---|---|---|---|---|
Administration Interface | 2 | 0.5 | 1 | 128 | 256 |
Audit Logger | 2 | 1 | 2 | 760 | 1520 |
Audit Logger Worker | 1 | 1 | 1 | 760 | 760 |
Data Collector | 2 | 1 | 2 | 760 | 1520 |
Data Processor | 1 | 3 | 3 | 1024 | 1024 |
Identity Management | 2 | 1 | 2 | 760 | 1520 |
Tokens | 2 | 1 | 2 | 760 | 1520 |
Visual Renderer | 2 | 1 | 2 | 760 | 1520 |
Total | 15 | 9640 |
Accordingly, for Kubernetes you need a worker node with 15 CPU core-equivalents and 9.6GB memory.
Required storage
As Amazon AWS does not provide an option to set the storage limit when creating an Amazon MQ broker for ActiveMQ, no conclusive statements can be made about required storage for ActiveMQ.
MySQL Audit database
The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.
Schema | Table | Required storage per day | Required storage per month | Required storage for 3 months |
|---|---|---|---|---|
audit | audit events | ~17 GB | ~527 GB | ~1581 GB |
Postgres Events database
The storage test was run with a rate of 105 requests per second and via the Event Simulator to obtain a range of different payloads to store.
Schema | Table | Required storage per day | Required storage per month | Required storage for 3 months |
|---|---|---|---|---|
events | events | ~8 GB | ~244 GB | ~733 GB |
facts | facts | ~3 GB | ~21 GB | ~21 GB |
facts | aggregate_staging | ~3 GB | ~3 GB | ~3 GB |
facts | facts_to_process | ~2 GB | ~2 GB | ~2 GB |
facts | events_by_country_per_hour | < 1 MB | < 1MB | < 1MB |
facts | events_by_device_model_per_hour | < 1MB | < 1MB | < 1MB |
facts | events_by_os_per_hour | < 1MB | < 1MB | < 1MB |
facts | events_by_country_per_day | < 1MB | < 1MB | < 1MB |
facts | events_by_device_model_per_day | < 1MB | < 1MB | < 1MB |
facts | events_by_os_per_day | < 1MB | < 1MB | < 1MB |
facts | events_by_country_per_month | < 1MB | < 1MB | < 1MB |
facts | events_by_device_model_per_month | < 1MB | < 1MB | < 1MB |
facts | events_by_os_per_month | < 1MB | < 1MB | < 1MB |
facts | device_models_dim | < 1MB | < 1MB | < 1MB |
Total | ~23 GB | ~270 GB | ~759 GB |
Archiving
You can configure an archiving process running on a regular basis if you wish to retain the data before it is deleted by the configurable cleanup job.
By default the cleanup job is disabled so it must be enabled for archiving. The cleanup job drops old partitions according to your configurations. For instance, if you configure the lowest possible value, "1", the cleanup job will drop any partitions that are older than 1 month. This means, you need to configure your archiving process to work according to the settings of the cleanup job so that your data can be archived before the cleanup job deletes it.
Helm overrides
The standard Threat View setup exceeds a capacity of 105 transactions per second. It could be cost optimized, if required, like using cheaper infrastructure, and you can set Helm overrides if you wish to run Threat View more cost-effective:
To set Helm overrides
Opt out from using a replica set with 2 pods to running 1 pod per service instead of 2.
Changing from the default replica set with 2 pods to a set with 1 pod can impact the availability of the service.
To opt out, set the
defaultReplicasvalue from2to1in your Helm overrides.Opt out of auditing the Data Collector service. This would shrink the auditing requirements to a micro level and can be done by setting a blank value for
AUDITLOGGER_SERVICE_URLondata-collector.Opt out of autoscaling, then reconfigure resource requests of individual services to a lower value.