Problem with thousands of pages made by refine by taxonomy and search engines

Posted by FlemmingLeer on September 9, 2008 at 9:58am

I enabled the module: refine by taxonomy, http://drupal.org/project/refine_by_taxo a while back and didn't think much about it until I discovered in Google Webmaster Tools that it produced some 50.000 additional pages which of-course was indexed by Googlebot !

My site has some 6.500 nodes at the time being covering politics in Denmark with the option for 12 taxonomies on each refine by taxonomy page. I have some 500 taxonomies defined. Refine by taxonomy is currently only avialable on Drupal 5.x

Refine by taxonomy provides users the ability to browse and filter taxonomies by selecting additional taxonomies to filter nodes by and thus narrowing the list of nodes showed.

I also have pathauto installed http://drupal.org/project/pathauto which delivers search engine friendly links for the taxomomy.

Before the installation of refine by taxonomy my page rank was 5 and after it dropped to 4. This has almost cut my traffic by half.

I made this change in robots.txt:

User-agent: Googlebot  
# don't allow taxonomy directory to be indexed
Disallow: /taxonomy/
# useful additional blocking
Disallow: *from=
Disallow: *sort=

I also blocked the parameter from= & sort=. Somehow from= gets indexed although I haven't noticed it in use in Drupal. sort= is used in forums and adds additional 7 pages available for indexing by search engine robots.

Now I just sit back and wait :/ And cross my fingers that my pagerank 5 returns after this hickup :)

Comments

.

Posted by Z2222 on March 25, 2022 at 1:05am

rel="nofollow" and flood control

Posted by markus_petrux on September 9, 2008 at 1:46pm

Hi,

I also found this problem on a production site when adding tagadelic alike blocks and faceted support to the sphinxsearch module (offers similar functionality than refine_by_taxo).

It was Google, so I added rel="nofollow" to the URLs of the terms in the tag clouds. No success yet. Google was still following the URLs. So I ended up implementing flood control, then it is possible to restrict the number of accesses through this kind of URLs. Contact forms in core also implement flood control.

Not sure if there's any contrib module that implements flood control by URL, but I'm afraid that's the only way to block this kind of abuse.

nofollow

Posted by Z2222 on September 9, 2008 at 2:30pm

Nofollow won't remove URLs that are already indexed. It needs robots.txt also.

--
My Drupal Tutorials

/search was already in robots.txt

Posted by markus_petrux on September 9, 2008 at 2:37pm

Which is the path used in the above mentioned example. In fact, /search/ and ?q=search/ are already present in Drupal's provided robots.txt, and I found Google was still following those URLs. :-o

The only way to stop was flood control.

Drupal Google Indexed

Posted by Z2222 on September 9, 2008 at 3:58pm

Were there any links pointing at search results in the past? Or is it a 4.7 site that was upgraded to a Drupal 5 site? (4.7 didn't have a robots.txt file)

Were the results in Google's SERPs full indexed pages, or just titles like this?
http://www.google.com/search?q=site:http://drupal.org/search/node&num=100&filter=0

Google will index pages blocked by robots.txt, but not crawl them (in theory). It's just the existence of the pages that is indexed. If they are nofollowed and robots.txt'd then they should eventually disappear.

--
My Drupal Tutorials

It's a D5 site

Posted by markus_petrux on September 9, 2008 at 4:08pm

Here you can see a bunch of URLs indexed by google

http://www.google.com/search?q=site:http://blogs.gamefilia.com/search

URLs use nofollow and /search/ is un robots.txt:

http://blogs.gamefilia.com/robots.txt

No sure why, but I added flood control and the problem is not so big.

PS: I hope this is still in interest for the original post.

Yes

Posted by FlemmingLeer on September 9, 2008 at 4:23pm

Markus,

Yes it is.

I think the reason Googlebot is not accepting the block in robots.txxt is that it is not defined and therefore index everything on the site as though no robots.txt was there.

Try copying all of the robots.txt

and add it above your current and add:
User-agent: Googlebot
above it instead of just User-agent: *

That should do it.

Odd thing is that when I check your robots.txt on
http://www.google.com/webmasters/tools
it indicates that googlebot will indeed accept your current robots.txt. :O

J.Cohen;
Very nice robots.txt coding. I didn't think about that small effective solution.

Even a turtle reaches it´s goal...

It looks like we both

Posted by markus_petrux on September 9, 2008 at 5:54pm

It looks like we both are learning something here. :)

robots.txt

Posted by Z2222 on September 9, 2008 at 5:10pm

Your robots.txt file says:

Disallow: /search/

Your URLs are in the format:
/search?

Add this line to your robots.txt and it should be fixed:

Disallow: /search?

--
My Drupal Tutorials

Thanks for the heads up. :)

Posted by markus_petrux on September 9, 2008 at 6:04pm

Thanks for the heads up. :)

I also added the other one:

Disallow: /?q=search&

In fact I was surprised that Google was not following robots.txt, silly me. lol

.

Posted by Z2222 on March 25, 2022 at 1:05am

Problem with thousands of pages made by refine by taxonomy and search engines

Comments

.

rel="nofollow" and flood control

nofollow

/search was already in robots.txt

Drupal Google Indexed

It's a D5 site

Yes

It looks like we both

robots.txt

Thanks for the heads up. :)

.

Search Engine Optimization (SEO)

Group organizers

New groups

Group notifications