Release 6202: Mark processes as unkillable when out of memory

In this release we add a command that can be used to start processes that will never be OOM-killed even when the system is completely out of memory. When a Hypernode runs low on memory and a process lays claim on previously allocated pages when there is no more RAM available, the Linux kernel will use a heuristic to determine what process in the cgroup should be killed to free up memory. The new hypernode-oom-protect command will allow processes to be spawned that will always be skipped over by the OOM-killer. This allows you to prevent specified scripts from not completing if they would otherwise be Out Of Memory Killed.

Prioritize important processes

On Hypernode we have a long history of implementing tweaks to fine-tune the memory allocation and prioritization between user scripts, web services like NGINX, PHP and MySQL, and system level processes like sshd and cron. When deciding how to balance the RAM footprint of the server, we have always chosen to keep webshop responsiveness and the capability to process orders in the highest regard.

Our policy is to attempt to ensure that no visitor pattern, server side script, process or cron can cause the website to become inaccessible due to reserving memory that would otherwise be used to serve requests from real customers. The second consideration that we have to maintain is to ensure that our auto-healing systems and other root level automation can always interact with the system regardless of memory pressure in the app user space.

Doing so enables us to always be able to perform operations, patches and optimizations on all servers on our platform without having to worry about creating configuration drift because some of the nodes in the network were too memory-constrained for the maintenance, update or auto-healing logic to be executed.

But over time we have learned from various customers that in some cases it would in fact be preferred to risk impairing the webshop’s ability to serve requests if that means that whatever process is to be executed can be completed without getting killed to make room for otherwise mission critical services.

Examples of situations like this are: memory heavy product import scripts, inventory synchronization with external systems and periodic server side tasks that are important but disrupt the regular balance of memory. Processes that run intermittently but consume a large amount of memory risk getting culled to make room for the services required to run the site if said process oversteps a boundary and pushes the system to venture into a memory constrained situation.

Today we add the functionality to deliberately run processes that are allowed to risk impacting otherwise ‘mission critical’ services by marking them as exempt from any memory killing. For example, this means that it is possible to run memory heavy PHP command-line processes and when the system then runs out of memory anything except for processes marked as such will be killed, including those required to serve the website.

Example

To protect a command from the OOM-killer and give it maximum priority over all other services you can simply ‘wrap’ the command with the hypernode-oom-protect command. This will start the command specified as normal, except for that the oom_score_adj will be set to -1000.


hypernode-oom-protect php /data/web/memory_hungry_script.php

Because the MySQL process is also marked as unkillable by default this makes it possible to execute scripts to interact directly with the database with (for example using PDO) that will be unhampered by the memory state of the system.

Mind the parent process

Keep in mind that while the process that you start (and all its children) will be protected, any parent process will not be. For example, if you start the script from an ssh bash shell, the shell itself might still be killed but the process will continue to run in the background. This is because a process tree like such will be created, but only the started command at the end of the tree will be protected.


_ sshd: app@pts/0    
    _ -bash   
        _ hypernode-oom-protect /usr/lib/python3/dist-packages/kamikaze3/privileged/oom_protect.py php /data/web/memory_hungry_script.php
            _ /usr/bin/sudo /usr/sbin/hypernode_oom_protect_wrapper
                _ /bin/su --login -c /tmp/hypernode_oom_protect app
                    _ /bin/bash /tmp/hypernode_oom_protect
                        _ php /data/web/memory_hungry_script.php

So if the output of your script is important make sure to write it to a file with logging or to start the command in something to which you can later connect like a screen: hypernode-oom-protect /usr/bin/screen -d -m php /data/web/memory_hungry_script.php, and then later attach to the screen with screen -x.

There is still a finite amount of memory

Marking processes as unkillable does not magically add more memory and it will also not prevent memory allocation errors like PHP Fatal error: Out of memory (allocated 1234) (tried to allocate 12345 bytes). It serves a nice purpose of taming the OOM-killer, but eventually it only shifts the problem slightly. If you persistently have problems with memory (and those not caused by a structural misallocation of resources), the real solution would still be to upgrade to a bigger plan.

When the cgroup in which the app user processes run completely runs out of memory and the only processes left are ones that are marked as unkillable, processes will start to hang in D-state until more memory becomes available.

If all the processes that remain are marked with OOM_SCORE_ADJ_MIN, the behavior of the cgroup will be effectively the same as if the OOM-killer has been completely disabled. Luckily unlike processes that enter uninterruptible sleep because of IO, memory-starved processes in D-state can be killed (by the outer cgroup).

How we maintain stability with auto healing

Because all these systems operate in a sub-slice of the VM, we have the guarantee of still being able to perform root level tasks even when the memory pressure is at the point where the app user space services start grinding to a halt. With the new freedom of explicitly marking processes with the highest memory priority, it is possible to make conscious trade-offs in the stability of your site. But it also makes it possible to mess things up pretty badly.

To rein in the the consequences of possibly unexpected or unintended service downtime due to allowing users to give select processes access to unbounded memory use, our auto healing mechanisms will still kill the protected app user processes if our monitoring detects a prolonged downtime of the web services.

For example, when you run a script with hypernode-oom-protect and the process attempts to consume all the memory on the system, and by doing so kills php-fpm, nginx and other mission critical services, then our monitoring will give it a couple of minutes before stepping in because that might be what you intended (because your script allegedly is so important that it has priority over the availability of the webshop).

But if it then appears that the downtime lasts longer than a couple of minutes (perhaps your protected script got stuck / is blocked on allocating more memory than physically available), the auto healing will kill the process and ensure that the services are started once again. If that window is not enough for the script to complete, consider running the task at night as the delay before the auto healing kicks in is longer then than during the day due to the common pattern of heavy nightly Magento crons.

Caveats

Keep in mind that this only protects processes spawned with hypernode-oom-protect. It is not possible to mark existing processes as protected. This means that if you run a un-oomkillable process with the command and that process interacts with the php-fpm daemon (like POSTing to the Magento REST-API), the php-fpm workers executing the web calls might still be killed to maintain stability under sufficient memory pressure. As mentioned in the Example, this should not be a problem when you interface with MySQL directly.

This new tool will be deployed on all Hypernodes over the course of this week.

Other changes

There were some tweaks in our logrotate policies to prevent root disk space from temporarily not being released in some cases when logs had already been deleted
We installed httpie on all Hypernodes
We’ll also install webp on all Hypernodes