I enabled the module: refine by taxonomy, http://drupal.org/project/refine_by_taxo a while back and didn't think much about it until I discovered in Google Webmaster Tools that it produced some 50.000 additional pages which of-course was indexed by Googlebot !
My site has some 6.500 nodes at the time being covering politics in Denmark with the option for 12 taxonomies on each refine by taxonomy page. I have some 500 taxonomies defined. Refine by taxonomy is currently only avialable on Drupal 5.x
Refine by taxonomy provides users the ability to browse and filter taxonomies by selecting additional taxonomies to filter nodes by and thus narrowing the list of nodes showed.
I also have pathauto installed http://drupal.org/project/pathauto which delivers search engine friendly links for the taxomomy.
Before the installation of refine by taxonomy my page rank was 5 and after it dropped to 4. This has almost cut my traffic by half.
I made this change in robots.txt:
User-agent: Googlebot
# don't allow taxonomy directory to be indexed
Disallow: /taxonomy/
# useful additional blocking
Disallow: *from=
Disallow: *sort=I also blocked the parameter from= & sort=. Somehow from= gets indexed although I haven't noticed it in use in Drupal. sort= is used in forums and adds additional 7 pages available for indexing by search engine robots.
Now I just sit back and wait :/ And cross my fingers that my pagerank 5 returns after this hickup :)
Comments
.
.
rel="nofollow" and flood control
Hi,
I also found this problem on a production site when adding tagadelic alike blocks and faceted support to the sphinxsearch module (offers similar functionality than refine_by_taxo).
It was Google, so I added
rel="nofollow"to the URLs of the terms in the tag clouds. No success yet. Google was still following the URLs. So I ended up implementing flood control, then it is possible to restrict the number of accesses through this kind of URLs. Contact forms in core also implement flood control.Not sure if there's any contrib module that implements flood control by URL, but I'm afraid that's the only way to block this kind of abuse.
nofollow
Nofollow won't remove URLs that are already indexed. It needs robots.txt also.
--
My Drupal Tutorials
/search was already in robots.txt
Which is the path used in the above mentioned example. In fact, /search/ and ?q=search/ are already present in Drupal's provided robots.txt, and I found Google was still following those URLs. :-o
The only way to stop was flood control.
Drupal Google Indexed
Were there any links pointing at search results in the past? Or is it a 4.7 site that was upgraded to a Drupal 5 site? (4.7 didn't have a robots.txt file)
Were the results in Google's SERPs full indexed pages, or just titles like this?
http://www.google.com/search?q=site:http://drupal.org/search/node&num=100&filter=0
Google will index pages blocked by robots.txt, but not crawl them (in theory). It's just the existence of the pages that is indexed. If they are nofollowed and robots.txt'd then they should eventually disappear.
--
My Drupal Tutorials
It's a D5 site
Here you can see a bunch of URLs indexed by google
http://www.google.com/search?q=site:http://blogs.gamefilia.com/search
URLs use nofollow and /search/ is un robots.txt:
http://blogs.gamefilia.com/robots.txt
No sure why, but I added flood control and the problem is not so big.
PS: I hope this is still in interest for the original post.
Yes
Markus,
Yes it is.
I think the reason Googlebot is not accepting the block in robots.txxt is that it is not defined and therefore index everything on the site as though no robots.txt was there.
Try copying all of the robots.txt
and add it above your current and add:
User-agent: Googlebot
above it instead of just User-agent: *
That should do it.
Odd thing is that when I check your robots.txt on
http://www.google.com/webmasters/tools
it indicates that googlebot will indeed accept your current robots.txt. :O
J.Cohen;
Very nice robots.txt coding. I didn't think about that small effective solution.
Even a turtle reaches it´s goal...It looks like we both
It looks like we both are learning something here. :)
robots.txt
Your robots.txt file says:
Disallow: /search/Your URLs are in the format:
/search?
Add this line to your robots.txt and it should be fixed:
Disallow: /search?--
My Drupal Tutorials
Thanks for the heads up. :)
Thanks for the heads up. :)
I also added the other one:
Disallow: /?q=search&In fact I was surprised that Google was not following robots.txt, silly me. lol
.
.