Understand Drupal's logs better with regard to search engines

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
mgifford's picture

I have recently seen a number of references in the detailed logs of "Referrer http://www.google.com/" -- what's going on here?

Decided that this might be the most knowledgeable place to post a question about what the logs are telling me. I do watch my referrer logs and especially using search_keywords I am able to see how quickly a blog posting is picked up by google in a way where people are actually clicking on it. I wasn't clear though why when I am viewing the track on a particular node (say node/344/track) why Google is listed as a referrer. If it's just the Google bot traveling by I'm unsure why it would have stopped by and visited this one page 11 times in one day. If it's actually a user clicking on a page I would expect to see the full URL that were using.

It would be nice if the logs simply indicated that this was the Googlebot zipping by if this is the case. I'm also a bit surprised that the logs show that there are so many anonymous users without references to any referring URL. I suppose that those could just not be passed along by the browser. However, wondering how often these might also be bots.

I've been quite successful with search engine optimization & Drupal, but there are a lot of assumptions to get a hold of. What other mysteries are there to help tracking search engine traffic (and success placing with them) using Drupal?

Mike

Comments

personalized homepage

greggles's picture

I believe this is from the personalized homepage, though I'm not 100% sure. They change things around so often it's hard to tell.

--
Knaddisons Denver Life | mmm Free Range Burritos

Referrer spam? I think it's a bot.

mcurry's picture

From what I can see, it's referrer spam. Digging thru the server logs shows that it's not a human - at least on my sites - it looks like a bot net.

The bot was using http://alti.asu.edu as the referrer, then it switched over to using http://www.google.com/ - same access pattern, different referrer.

Here is an excerpt of just one out of dozens of IP addresses, all claiming to be coming from http://www.google.com - um, I'm pretty sure I'm not getting a front-page link from Google - this is happening on most of my sites, some of them pretty obscure, so I'm fairly sure it's not human. Note the timing of the hits.... I can't see how this is a human - each of those hits would have to be from a click on a link from another site if the referrer field is to be believed...

133.86.232.101 - - [29/May/2007:16:53:28 -0400] "GET /node/10 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:29 -0400] "GET /node/10 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:31 -0400] "GET /node/111 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:32 -0400] "GET /node/111 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:34 -0400] "GET /node/159 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:36 -0400] "GET /node/159 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:37 -0400] "GET /node/176 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:38 -0400] "GET /node/176 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:40 -0400] "GET /node/180 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:41 -0400] "GET /node/180 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:43 -0400] "GET /node/200 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:44 -0400] "GET /node/200 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:45 -0400] "GET /node/21 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:46 -0400] "GET /node/21 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:48 -0400] "GET /node/273 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
133.86.232.101 - - [29/May/2007:16:53:49 -0400] "GET /node/273 HTTP/1.1" 403 58 "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

[hundreds of similar lines snipped - many hundreds of these hits per day on my most popular site with between 800 and 1200 legitimate, human visitors per day.]

I'm blocking that exact referrer (the URL "http://www.google.com/" but NOT referrers like "http://www.google.com/search?q=foo") because it's clear these are not human visitors. That's why you see the 403 error - I don't want this bot eating my server resources, thank you very much.

See:
http://drupal.org/node/24302#comment-228958
http://drupal.org/node/27787 (patch to bootstrap.inc and related modules to allow drupal-based referrer banning so you don't have to hack .htaccess)
http://drupal.org/node/133914

No doubt for me

scor's picture

http:// www. google. com/ as a referer is for sure a spam bot, at least in my case. 90% of the spam I got during the last weeks are using http:// www. google. com/ as referer. The reason why you have so many hits is that they are looking for the right page and trying to post their rubbish anywhere they can.
How can a normal user come on your website from http:// www. google. com/ without any keyword in the url? As far as I know, google doesn't advertise links on its homepage.

I wish!

mcurry's picture

I wish I could get a link from Google's homepage!

This is obviously a growing problem... I'd love to come up with a reliable solution to this problem - last time I tried the badbehavior module, it had way too many false positives on my sites. Ah well, back to playing whack-a-mole. At least this bot is not consuming many of my site resources these days - since I don't bother initializing the full Drupal system, we ban them early.

Michael Curry
Exodus Development | Drupal and other developer info

I just checked my logs

scor's picture

I just checked my logs again, and found many dummy searches having the famous spam referer,
xxxxxxxx (Content).
and also loads of dummy login attempts like
Login attempt failed for qwrqwerwe.
They might solutions to this ever changing problem, but are they really reliable (risk of false positives as you mention)?

Damn Bots

mgifford's picture

Yeah, there are an increasing number of bots out there who are traveling around looking to sabotage your site. Using captcha is a critical way to filter these folks out, but I've had fatal errors with the latest release of the module so have just had to disable some elements of my site until it is fixed.

Would be good to be able to go through the logs by IP address and have known good/bad bots be indexed properly. We also should be able to eliminate bad behavior. Guess that would mostly be upgrading http://drupal.org/node/30501 to 5.x

Anyways, thanks again for all of the feedback on this subject.

Mike

OpenConcept | WLP | FVC | OX | OO

Hi, I am observing similar

hpk's picture

Hi, I am observing similar behaviour in all my drupal websites. Also I some times see log entries like "comment/reply/51" or a lot of "/" page not found entries. Is this behavior common? If not what can I do to get things better.
Please help.

Vikram Singh
http://hpk.co.in/