Release 7785: Increasing performance on Hypernode

At Hypernode, we constantly look for opportunities to improve our platform. In a proactive approach for the coming holiday season, we’ve been benchmarking our platform to find possible improvements of our Hypernode setup. Here is a list of changes that we’ve made to our platform that give a significant improvement of response times and reduce error rates on your Hypernode.

PHP workers to CPU core ratio

Our ratio of PHP-FPM workers to CPU cores has always been 5 PHP-FPM workers to 1 CPU core.

We noticed a recurring pattern in our benchmarks where our ratio of PHP-FPM workers turned out to be suboptimal.

Based on our benchmark results (in the graph above), we found that a ratio of 4 PHP-FPM workers to 1 CPU core is the most performant of all tested ratios. On average, this lead to a response time that is 5.9% faster than it used to be with a ratio of 5 PHP-FPM workers to 1 CPU core.

The reason for this increase is probably because too many processes cause the server to keep context switching instead of actually serving requests.

This change in PHP-FPM worker to CPU cores is live since October 27th on all Hypernodes.

CPU priority for Nginx and Varnish

We’ve noticed during our benchmarks that when we applied a high load to the Hypernode, the load of the server went up so high that it was bottlenecked on the CPU. At that point, the CPU scheduler has to make decisions what to spent its time on.

When your Hypernode is experiencing a high CPU load, you want to ensure that most of the requests coming to your Hypernode will still succeed. In order to achieve this, we want to deliver as many cached and static requests as possible, because these don’t impact the CPU as much as dynamic requests.

Cached and static requests come directly from Varnish and Nginx, so we want to increase the priority of these processes in the CPU scheduler. We do this by changing the CPUWeight of both Varnish and Nginx, and use nice to set the priority of all services on the Hypernode.

Increasing CPUWeight

We’ve increased the CPUWeight of Varnish to 200 and for Nginx to 400. The default value for all services is set 100.

This change makes it so that the available CPU time, which is split up amongst all service units in a slice, will have an increased share towards Nginx and Varnish.

The reason we have applied this change is to ensure Nginx and Varnish have a larger share of system resources to make sure that more requests can be served.

You can verify this change by checking the cpu.shares value on your Hypernode:

root@wifamk-testfarzad-magweb-cmbl ~ # cat /sys/fs/cgroup/cpu,cpuacct/limited.slice/php7.0-fpm.service/cpu.shares 
1024
root@wifamk-testfarzad-magweb-cmbl ~ # cat /sys/fs/cgroup/cpu,cpuacct/limited.slice/nginx.service/cpu.shares 
4096

Use nice on all managed processes to manage CPU priority

The command nice can be used to specify what priority a process should get from the Linux kernel scheduler. Processes with a higher priority are scheduled to be executed before processes with a lower priority. You can specify a value from -20 to 19 with nice, where -20 is the highest priority and 19 is the lowest priority. All processes default to a nice value of 0.

We want to assure that Nginx and Varnish get the most priority when there is a high load on the shop, so we specify the CPU priority with the following nice values:

Nginx, Varnish gets a nice value of -15
Other important services such as PHP, MySQL,SSH, Postfix, etc keep the default nice value of 0
Observing services such as telegraf get a nice value of 10
Log shipping services such as log-courier get a nice value of 15
Services that are used irregularly like sftpd, fail2ban, ntp, nginx-config-reloader get a nice value of 19

You can see the nice value using either top or htop on your Hypernode; check the 4th column NI to see the corresponding value.

Caching Hypernode monitoring requests

Three of our heartbeat servers will make a request to your Hypernode every 2 minutes. This does a series of checks which includes checking if all critical services like Nginx are still running, checking the available space left on your root disk, etc. Read this article for a detailed explanation of what our monitoring does

This monitoring is critical for our platform to ensure our systems can act upon any issues that are found on your Hypernode. However, when your shop is in a critical state where all resources are needed to serve requests to your users, we don’t want to cause any more load on your Hypernode than necessary.

We started caching monitoring results of any system checks that happen. When our first heartbeat server checks the status of your Hypernode, it does a check-up on all system services as usual. However, the second and third heartbeats will receive a cached result that is valid for one minute. This prevents any duplicate system checks within the span of a minute, but will still keep checking if Nginx and PHP are up and running as intended (because these services will still have to respond to the monitoring requests).