In this release we have made it so that the innodb_buffer_pool can be configured by our support on request in a way that will persist between plan upgrades and downgrades, we added cron jitter to dampen steal spikes in public and private clouds and we have made some other changes.

The innodb_buffer_pool_size

Modifying the innodb_buffer_pool_size is a request we get quite often. Increasing the innodb_buffer_pool_size can result in great performance improvements if your DB is doing a lot of disk IO that would otherwise come directly out of memory if it was cached. However, on Hypernode we have defaults for these settings based on the size of the instance. While these defaults are good for most average shops, the ‘optimal’ innodb_buffer_pool_size is not so much bound to the size of the intsance and the available memory but to the size of the dataset of your database.

For this reason it can make sense to want to increase (or decrease) the pool size to something that fits your application better. However, tweaking this setting should be done carefully because if done incorrectly it might jeopardize the balance of memory with other services. For example, if the Hypernode instance also runs memory hungry services like ElasticSearch bumping this setting too high might result in unstable situations.

A good balance of memory on Hypernodes is core to the stability of the shop that runs on it. There are a lot of factors that influence what will constitute a ‘good’ and ‘safe’ value for the innodb_buffer_pool_size. On Hypernode there are things that should come into consideration when choosing this buffer size such as the fact that Hypernodes currently have swap disabled and an elaborate cgroup configuration for selective oomkilling so that servers can quickly kill and recover in situations of memory pressure instead of snowball and grind to a halt while paging the wrong memory to disk. Other considerations are that a lot of the shops (still) utilize the MyISAM storage engine instead of only InnoDB which complicates finding the best value for this setting.

If you have fair amount of MyISAM, Archive, PBXT, Falcon or other storage engines then you will get into complex balancing game besides considering all these factors.

A quote from this interesting blog post by Percona.

While taking all these factors in consideration we realize that even though Hypernode is a standardized platform there are (many) valid scenarios in which this value should be allowed to be tweaked. Many nodes have underutilized memory and a heavy database, in those cases it makes perfect sense to increase. In the context of our highly automated hosting platform price the features of our automation come at the cost of having to put constraints on what is configurable by users and what is not. A lot of these settings might seem trivial (and they would be on any bespoke hosting) allowing too much flexibility could have long term ramifications for what features we can develop and support in the future.

A simple example is that if we allow values that (over time) cause the node to have an impossible situation where a service is repeatedly killed or crashing due to memory pressure our autohealing or upgrade/downgrade processes could fail causing downtime. At the scale we’re operating it would be impossible to detect what type of downtime would constitute a user tweaking their settings vs a platform issue. For this reason this setting will now be configurable, but only by our support upon request instead of directly by the end-user through the API like most other settings.

Note: if this value is set to more than 70% of the available system memory it will be capped to that 70%. This is to prevent plan downgrades from resulting in a buffer value higher than the memory available on the system.

Cron jitter

Hypernodes can both be hosted in public clouds (AWS / DigitalOcean) and a private cloud like our Combell OpenStack. Cloud and VPS offered the promise of an isolated hosting environment where you wouldn’t be bothered by neighboring sites anymore like on the shared hosting of the past. While this is true for most facets of the hosting experience, under all the abstraction there is still a big computer somewhere running all these (virtual) servers. We have noticed that depending on the mix of applications that are hosted on individual hypervisors at cloud providers with some bad luck it is very possible that Magento processes get caught in a thundering herd if too many of the VMs on the same hypervisors in the cloud attempt to perform a task at the same time.

The biggest offender here are minutely crons. Regardless of the application a lot of VMS utilize some mechanism of starting up a process on the minute or every five minutes using CRON. So while on a public cloud a hypervisor might mix VMs of all type of applications it doesn’t mean those will all be doing different types of work at different times. And especially if it turns out that a large amount of Magento VMs are on the same hypervisor it is possible that trying to run that Magento cron exactly on :00 every minute is causing a slight performance impact for the first few seconds of every minute depending on the workload.

To mitigate the effects of your cloud instance being impacted by other cloud instances requesting CPU time at exactly the same moment your shop is we have developed an extension to our automated flockerizer CRON deadlock protection that will make sure by default all crons you configured to run every minute won’t necessarily run ‘on the minute’ but ‘every minute’ at a constant offset interval.

It works as the following: our cron flockerizr previously would add a ‘flock’ around your cron where it would prevent the process from stacking up if it has not finished before it attempts to run again. In the types of shops we run this is a great default as it prevents unwanted situations where crons stack up or processes end up deadlocking each other because they are not safe to run twice at the same time. For cron tasks where this does not make sense those can be explicitly excluded by adding a #noflock comment to the end of the cron. A cron that is often excluded like this is the Magento 2 event listener (queue consumer) for asynchronous operations.

* * * * * php /data/web/magento2/bin/magento cron:run --group=consumers #noflock

 

Instead of a flocked cron like:

flock -n ~/.cron-foo_bar.lock php /data/web/magento2/bin/magento cron:run --group=foo_bar

 

The new jitter functionality works the same. But instead of adding a flock it will now also add a sleep to make sure any periodic cron tasks are deferred. There will be a static amount of seconds between 0 and 30 each cron will sleep for before starting. This number will be hashed based on the name of your app and thus will remain consistent between plan upgrades and downgrades. For most crons it will not matter if the task is performed ‘on the minute’ vs just ‘every minute’. So it will run every minute on for example :15 instead of :00. If this is an issue for your cron and it does need to run exactly ‘on’ the minute, crons can be excluded by adding the #nosleep comment like #noflock before.

If a cron is changed to include the sleep it will look something like:

sleep 22; /usr/bin/php7.2 /data/web/magento2/bin/magento cron:run 2>&1 | grep -v "Ran jobs by schedule" >> /data/web/magento2/var/log/magento.cron.log # noflock

 

We are going to roll out this functionality on the entire Hypernode platform over the course of the coming week(s). We are expecting that this will make a slight positive impact on performance, especially regarding spikes in response time and potential CPU steal issues on busy seconds on the hypervisor like the few seconds following every 5 minute :00 moment.

Other changes