Prioritising customer changes in our backend system

In this release we have made a big change to our back-end system to decrease the time it takes for customer changes to be propagated throughout the platform when we are running system jobs. Our automated systems periodically perform operations on all Hypernodes, like installing updates, running backups and executing various eventual consistency assertions like verifying if all monitoring checks are in place. These system jobs can block customer actions like adding SSL certificates, SSH keys or switching the PHP version until completed. This week we deployed a new version of our worker system into production that can take priorities of jobs into account. With this new feature some of those system operations (like software updates) should no longer block customer initiated actions as often, which should decrease the time it takes for changes made in the service panel to land on the node when the system is performing background jobs.

Whitelisting the Pingdom user agent

In other news, we no longer exclude Pingdom from the default rate-limit on Hypernode. For our internal monitoring we have replaced Pingdom with an ElasticSearch backed alerting system named ElastAlert. If you still use Pingdom yourself for application level monitoring you can add the user agent back into the whitelist again according to this article.

# before
$ grep pingdom /etc/nginx/nginx.conf | head -n 1
        ~*(google|bing|pingdom|uptimerobot|shoppimon|facebookexternal|monitis.com|Zend_Http_Client|magereport.com|SendCloud/|Adyen) '';

# after
$ grep heartbeat /etc/nginx/nginx.conf | head -n 1
        ~*(google|bing|heartbeat|uptimerobot|shoppimon|facebookexternal|monitis.com|Zend_Http_Client|magereport.com|SendCloud/|Adyen) '';

This week we noticed an uptick in brute force attacks originating from IPs based in China and HongKong. In this release we have also expanded our country based automated bot blocking to include HongKong for when requests that look like an attack are detected. If our automated systems detect such an attack you will receive an email and an NGINX configuration file like this will be created to fend off the attack:

$ cat /data/web/nginx/server.block_hk 
#Placed by Hypernode automation on 2018-08-20 14:43
if ($geoip_country_code = HK) { return 403; }

Other changes

  • Improved auto restart for when MySQL becomes stuck in a CPU loop due to repeated crashes in low memory situations
  • OCSP cache warming will now happen in an asynchronous process to make NGINX restarting and reloading faster
  • OCSP cache warming will only happen when there is enough memory available on the system so a Hypernode will more gracefully degrade in case of insufficient resources
  • We have increased the throughput of our job processing system tenfold by offloading more computation to worker processes. This will decrease the time it takes to perform platform wide updates.