Are you suffering from super high CPU server usage but can't nail what it is? It is probably bots and/or spiders eating your processing power. Here's how to [mostly] stop it.


Look at this image, taken from a live server hosted by our friends at Digital Ocean. The blue line indicates the total CPU usage on the virtual server. It is maxing out and at times exceeding capacity from about 7:50 am to 11:00 am. You can see during that same time period a very large dip around 9:00 am. This is most likely a reboot.


 

What does this mean ultimately? Your Nginx service will crash because too many (www-data) web requests are eating up all available resources and your users will probably get an Nginx error message or timeout errors. Put simply bots have crashed your site, DDoS style! Now it could be unchecked ssh attempts (we'll cover that with Fail2Ban later)...

If you see this situation (and you are using Ubuntu) take at look at your Nginx access log file located in: /var/log/nginx/access.log. Look for repeated access requests from the same IP with the same user agent. Hundreds of requests with the same time stamp are red flags. You know your typical hourly or daily usage (how many visitors to your site per day); a huge increase is a script or automated attack. A good rule of thumb to remember is for every 100 humans visiting your website there are probably 2 - 3x's that amount of bots using a finite resource pool.

For our example, there were approx 26 thousand requests with this IP and header in just a few hours:

46.229.168.65 - - [01/Dec/2016:01:26:10 -0800] "GET /yourpublicdirectory/somecomponent/examplepage HTTP/1.1" 403 162 "-" "Mozilla/5.0 (compatible; SemrushBot/1.1~bl; +http://www.semrush.com/bot.html)"

Look closely and you can see the 403 status code (since we already blocked these a-holes before). If you are using a CMS WAF tool, be sure to block the offending IP. If you are not using a CMS, be sure to take the required steps to manually block the IP in a blacklist file that your Nginx configuration points to.

There is a pretty good guide over at nixCraft that can help you get started but be warned, IP addresses can change frequently since most bad bots are compromised servers, insecure IOT devices, etc. Once largely discovered they move on to the next IP and the next group of victims.

One should also try to block the bots by using header information (also somewhat porous for the same reasons IP blocking isn't 100%). Add this to you Nginx configuration file in your main location / { ... } block:

#STOP BOTS AND SPIDERS WWW-DATA TRAFFIC if ($http_user_agent ~* (libwww|Wget|LWP|damnBot|BBBike|java|spider|crawl|^Sem|SemrushBot|^spider|^crawler|^web|^BaiDuSpider|^Yandex|^Exabot|^MJ12bot|^Java|^Twiceler|^Baidu|^Ahrefs) ) { return 403; }

Now some bots are ok like Google, Bing and DuckDuckGo. Those bots come through scraping public folders, obey your robots.txt rules and move on until the next time (a few in this explicit list might include safe-ish bots but why serve them if your target is only in Europe or only in the United States). Others do the opposite; they ignore your rules, try to be as invasive as possible, look for weaknesses and steal your content for valuable data like email address, passwords and other confidential information.

Tired of managing these things, does all this talk of configuration look like Greek to you? gripfastistech.com can help! Simply email us or leave a comment; we'd love to manage your project or website and for those who are afraid of commitment we also do consultations.

 

About the Writer
Chris Lessley
Author: Chris Lessley
A server admin, dev ops warrior and website designer since 2002, Chris is a lover of all things Linux and open-source! Each blog topic has been tested by fire in the real world and shared with the hope to help others. Need more help? Hire me! Chris' other interests include fine art and the humanities in the classical tradition and can be found writing for our friends over at gripfastart.works. If you like this content, kindly consider donating to keep this website free to all, without ads.

Comments powered by CComment

Member of The Internet Defense LeagueOpen Source Initiative