robots.txt
?q parameter
I just discovered an unfortunate function in Drupal 5.x (Drupal 5.20) which creates multiple content in Google.
http://www.example.com/?q=Drupal
Where Drupal is an url alias.
http://www.example.com/Drupal
&
http://www.example.com/?q=Drupal
are offcourse the same but google catches both and indexes them.
adding Disallow: /?q= to robots.txt wil block these multiple urls.
&from=1289 and node?page= produces multiple pages and fictional pages
Currently in Drupal 5.10 it produces multiple content in multiple urls:
domain/?page=16&from=1289
domain/?page=16&from=1357
Are currently indexed by Googlebot. But is being showed as double content for the same page in Google Webmaster Tools. In fact it displays the ?page=16
Similar to this ?page= produces fictional pages for the last page in tracker pages.
These pages are indexed by google:
domain/node?page=565
domain/node?page=751
domain/node?page=759
domain/node?page=787&%24Version=0&%24Path=/&%24Domain=.domainname.xx
But currently the last page is:
domain/?page=568
Problem with thousands of pages made by refine by taxonomy and search engines
I enabled the module: refine by taxonomy, http://drupal.org/project/refine_by_taxo a while back and didn't think much about it until I discovered in Google Webmaster Tools that it produced some 50.000 additional pages which of-course was indexed by Googlebot !
My site has some 6.500 nodes at the time being covering politics in Denmark with the option for 12 taxonomies on each refine by taxonomy page. I have some 500 taxonomies defined. Refine by taxonomy is currently only avialable on Drupal 5.x
Drupal Robots.txt
The default robots.txt file in Drupal 5.* has some problems. Also, the more modules one adds, the more duplicate content and low-quality URLs are created.
What robots.txt issues have people come across? Here are a few of my common modifications:

