Bots

Events happening in the community are now at Drupal community events on www.drupal.org.
Chris Graham's picture

Hi all,

May I firstly echo the comments that others have made about Drupalcamp Derry. Even though I missed the first part of the day on Saturday (thanks hangover) it was a great weekend and the open space discussions on the Sunday were really interesting.

Now onto my problem. I know this isn't really a Drupal issue, but I'm hoping it might be of use to some other Drupalistas.

I've noticed in our apache access logs that the majority of requests to our site are coming from bots, I want to exclude a few. I modified my .htaccess file to look like this:

# Various rewrite rules.
<IfModule mod_rewrite.c>
  RewriteEngine on

  # Reject bot traffic
  RewriteCond %{HTTP_USER_AGENT} ^Baiduspider.*$
  RewriteCond %{HTTP_USER_AGENT} ^MJ12bot.*$
  RewriteCond %{HTTP_USER_AGENT} ^Yandex.*$
  RewriteCond %{HTTP_USER_AGENT} ^Speedy\ Spider.*$
  RewriteCond %{HTTP_USER_AGENT} ^Ezooms.*$
  RewriteCond %{HTTP_USER_AGENT} ^spbot.*$
  RewriteRule ^.* - [F,L]

Thing is, the Baiduspider one got blocked successfully with a 403 the first few times, but now it seems to be getting through again and the others are all still getting 200 responses.

Am I going about this the wrong way or is there no way to stop determined bots from getting at your site?

Comments

I think I've come across this

Gerard McGarry's picture

I think I've come across this before, but if you're writing multiple ReWriteCond you have to use an [OR] rule at the end of each statement so that it processes all of them. Though it works on RegEx, doesn't it? Couldn't you write something like

Baiduspider|MJ12bot|Yandex|Etc...

other options

gavin.hughes's picture

Hi Chris

Why not try and ban these at a server level using something like fail2ban and or firewall rules, I believe web spiders can change their user agent on the fly...
Just of curiosity what’s the point of blocking these requests?

I'm wondering would a "sandpit" might help in this situation?
For example In robots.txt you specify a directory to disallow eg "Disallow: /sandpit/"
Upload a blank page to this directory (only "bold" spiders who ignore the rule will land on this page) on the page have a hidden link which you catch requests for by hidden link by IP and then ban their sorry ass

Sorry, I realised the [OR]

Chris Graham's picture

Sorry, I realised the [OR] was missing from my example. Even with the [OR] it still fails after the first match. I know they can change the user agent, but I've been using tail to view the log and the UA is the same each time.

I might try the regex pipe version to see if that makes a difference.

Our web server keeps going down, memory issues, and given that the majority of requests are coming from bots I decided to take a little strain off. They aren't just searching for files so every time they load a page that is one more Drupal bootstrap + Views call etc and that must have some impact on the running of the sites.

Banning IP addresses is probably going to be just as tricky. They seem to be dynamically changing that, and are probably spoofing it anyway.

I might just have to give up and accept that they'll be able to get through no matter how many measures I put in place.

I find the idea of banning

Alexander Ufimtsev's picture

I find the idea of banning spiders weird - what is the point of doing so if you run a public website? You just deprive yourself of potential traffic from these search engines. There is more than billion people in this world that use other search engines than Google or Bing. Cutting them off does not make too much sense.

That said, if you want to ban them - just do so via robots.txt - majority of search engines honour robots.txt standard and won't disturb your site anymore. This http://www.robotstxt.org/ is a good site for basic questions and configuration examples.

Problem is that most of the

Chris Graham's picture

Problem is that most of the are spam bots that spammers are then using to find our sites and spamming the hell out of us.

Illegitimate spambots rarely

Alexander Ufimtsev's picture

Illegitimate spambots rarely themselves in user strings - instead they would pretend to be a regular IE or FF users, as well as legitimate search engine bots, like Googlebot or others. By the way, when you say 'spam the hell out of us' - what do you mean? They make too many requests to your site or you have exposed email addresses on your pages?

If it is the former, you might benefit from the following directive in robots.txt

User-agent: *
Crawl-delay: 10

it instructs search engine bots not to make more than one request every 10 seconds. This will certainly help with legitimate crawlers while not cutting yourself off their indexes completely. As for illegitimate ones - it does not really matter: they can keep crawling your site with different user strings, ip addresses, so there is no real and effective way of stopping them. If they create too much load for your site, you might consider better hosting and/or better caching solutions for your Drupal installation.

It is more

Chris Graham's picture

It is more comment/registraton spam. Might not make a difference to block them, but I'm taking the Tesco approach. Every little helps :)

Other Bottlenecks

gavin.hughes's picture

Any other bottlenecks on the server that you could look at mail, dns, mysql, ...
I find htop to be pretty good for looking at server processes live and its more interactive but you may have a GUI on the box?

Also you could decrease server load with tools like APC, memcache, Varnish etc but you're probably already doing that... ;-) same probably applies to mollom, mod_evasive (for which i'm not crazy about myself)

I also use something called logwatch it can be great to identify a changes on the system (I used to get lots of failed login attempts on SSH at one point and enabling fail2ban with rule that banned offending IPs for awhile) pretty much solved that problem.

Ireland

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: