Diagnosing and resolving issues with excessive bot access

Diagnosing and resolving issues with excessive bot access

Sometimes, your site may experience performance issues because it is being crawled by too many bots at once. Many of these bots are useful and important for your site (such as search crawlers). So how do you determine when it’s a problem and what to do about it?

Diagnosing the issue

If you get a report of a service interruption, there’s a chance it’s because your site is experiencing increased traffic that is overwhelming its resources. It may help to profile your app and then tune it to get optimal performance. Or consider using a CDN.

Another solution is to upgrade your plan to dedicate more resources to your site. But it’s also worth checking if at least some of that traffic is being caused by bots you can limit.

Start by accessing your site via SSH.

Then, you can get a quick look at how many requests are from bots, for example on a given day:


$ cat /var/log/access.log | grep '6/Sep' | grep bot | wc -l

9765

This should give you an idea of whether the number of bot visits is high. (Note that logs are truncated to 100 MB but you can set up other services to get longer logs.)

You can also get a list of the names of the bots, such as those that hit your site today:


$ cat /var/log/access.log | grep $(date +"%d/%b/%Y:") | awk '{print $(NF-1),"\t",$NF}' | grep bot | sort | uniq

Googlebot/2.1; +http://www.google.com/bot.html)"

bingbot/2.0; +http://www.bing.com/bingbot.htm)"

...

Alternatively, look more broadly at what IPs have been accessing your site, such as the top hits today:


tail -n 5000 /var/log/access.log | grep $(date +"%d/%b/%Y:") | awk '{print $1}' | sort | uniq -c | sort -nr | head -n15

This returns a list of the most frequent IPs to have accessed your site today. You can then use a service like AbuseIPDB to check the IPs against known malicious actors. If you find anything malicious, skip down to the section on restricting its access directly.

Restricting useful bots

If you find out you have (non-malicious) bots accessing your site too often, you should restrict their access with a robots.txt file (assuming your site is configured to serve the file). This file offers directions for crawlers on how they should access your site to keep it from getting overwhelmed. Some rules are generally followed, though it can’t force malicious bots to follow its rules.

In particular, you can define which parts of your site search crawlers shouldn’t access:


User-agent:*

Disallow: /private-stuff/

and also set a specific crawl delay to make sure the bots aren’t hitting your site too often:


User-agent:*

Crawl-delay: 10

Note that some crawlers don’t respect crawl-delay and so require separate actions to make the settings stick.

These settings should be followed by search crawlers and other friendly bots. You can recheck their behavior after the robots.txt file has been deployed to make sure they are making fewer visits. If the bots aren’t respecting the rules, try restricting them directly.

Restrict anything malicious directly

If you come across access logs from a known malicious actor, it will not respect the restrictions in a robots.txt file. You have to restrict its access directly.

Set up HTTP access control rules to exclude the specific IP addresses of the bots in question. (See an example script to automate this process.)

1 Like