The last couple of weeks we have been mostly focused on improving our back-end logic for dealing with external API failures and implementing extra tests for our automation. However there were some things we did that might be interesting to hear about for Hypernode users so here is a short summary.
- OpenSSL CVE-2016-6304
Early last week a new OpenSSL CVE was released. Due to the nature of this security advisory it only had a very low potential impact for the Hypernode platform because denial of service is less of an issue in isolated environments compared to a shared hosting solution, but either way, with Hypernode’s continous integration setup the new patched packages were deployed as soon as they were made available.
Unfortunately our automated testing caught an error relating to installing the Magento 1 sample data shop with magerun after deploying this patch. It turned out that the OpenSSL patch was incomplete and caused openssl_x509_parse to segfault PHP when parsing certain SSL certificates. This crashed magerun like so:
$ /usr/local/bin/n98-magerun install --installSampleData='yes' ...
Magento Installation
- Installing byte-mag-mirror-latest (1.9)
Loading from cache
- Installing sample-data-1.9.1.0 (1.9.1.0)
Downloading: 0%Segmentation fault (core dumped)
It turned out that magerun gets the sample data from soureforge which just happened to have a certficate affected by this problem. Luckily this regression was fixed very quickly upstream, and again we were able to deploy this change as soon as the packages were released.
- SSH welcome message displays notification after OOM-kill
Some users pointed out that when they get kicked out of their shell session it is not always obvious that it was due to an out of memory emergency process cleanup. On Hypernode there is a daemon that runs in the background that listens to special control file descriptor provided by the cgroup notification API. This daemon in combination with a rigorously tuned cgroup configuration gives us very strict control over what processes to prioritize in case of low memory situations.
Normally the kernel would just start killing processes that request a page in memory that they had previously virtually been allocated. For high performance Magento hosting this can be rather cumbersome because the heuristic that defines what processes to kill does not take into account anything besides memory usage. Optimally you want to gracefully degrade the services on the machine instead of ostensibly randomly killing running programs. Because some processes are less important than others we preventively slay various user PIDs, including SSH connections if necessary in order to attempt to save the system from going down.
This release includes a change that attempts to (best effort) try and touch a file in the user home directory /data/web/.last_oomkilled
. This will not always succeed due to the dire memory situation but when it does the next time the user logs in he will be notified about the reason he got kicked out of the shell.