Have high CPU percentages on a Digital Ocean Droplet? Are you running mail service? Check this first!
We're a big fan of Digital Ocean and have been for years. They offer a service that is bare bones for developers to compete with huge hosting corporations and they let you make mistakes. DO expects you to learn your craft and troubleshoot things yourself. You can try it if you dare ultimate power.
However, in 2022 we were starting to have serious doubts (even considering migration to another provider) due to yet another series of high CPU events which plague one client every six months or so. Aside, that's great for billable hours but morally wrong. Grip Fast Information Services & Technology is your partner; not a 3rd party vampire-vendor who only seeks to suck the lifeblood from your hard-earned business. Your problem is our problem and we take performance personally.
As you can see here in the graphic below CPU usage began to rise dramatically around May 25th and quickly started to peak until around May 31st. Select the image for a large view.
On May 31st, after suspecting a "soft" DDoS attack, and dialing back the Bing Search Bot frequency (through Bing Webmaster Tools) and literally blacklisting the official Bing IP ranges of the search bot things were a little better but not ideal. Investigations continued. Tweaked system logging, logrotate and started reviewing all the services.
During our research, we integrated a neat tool called New Relic. This helped prove that the CPU usage was not being caused by MySQL (which top
and htop
hinted in the terminal) or Nginx-related issues which most guides online will discuss.
Come May 4, 2022, the Ubuntu server in question went super-critical. PHP-FPM services started slowing and some users started to see delays but the server never actually stopped. Many services were restarted during this time to debug.
On May 6th, we discovered a correlation between our email service (postfix); so we shut it down, mass deleted the mail queue and observed. Things got better again, around to 50%. No smoking gun. On the end of that day, the server lost the battle, reached 100% max CPU usage. Again, the server never stop functioning and the services hit-and-miss for some users. We were actively mitigating and managing processes.
May 7th we continued to monitor all the logs, the mail queue, the bots, etc. Didn't really make sense as there was nothing in the logs that showed a clear indicator of the problem.
Then we noticed on line 67 in the postfix configuration file main.cf
located in the /etc/postfix
directory.
mydestination = localhost.yourdomain.com, localhost, yourdomain.com
If you are a postfix expert you'll notice that the ".com" should not be present on the end of that line as it is already included in the first part of the line. The duplication caused a huge mail queue and the build-up of logs which in turn caused the excessive CPU usage on the Droplet.
At that point all was needed to was to remove the ".com," save the config file, restart the postfix service and then clear the remaining mail queue. The server's cpu usage dropped back to normal operations and from that point on, as shown in the graph, usage has been normalized back to 20-25% CPU utilization.
So the moral of the story is this: double check your configuration properties and make minute, granular and documented changes. That failing, give us a call at (559) 242-6647, send us an email or send a SMS now so we can help!
Comments powered by CComment