Posted by Z2222 on July 31, 2007 at 5:36pm
The default robots.txt file in Drupal 5.* has some problems. Also, the more modules one adds, the more duplicate content and low-quality URLs are created.
What robots.txt issues have people come across? Here are a few of my common modifications:
# Paths (clean URLs) -- modified from default
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/
Disallow: /contact
Disallow: /logout
Disallow: /search/The /user paths should have the trailing slashes removed or they won't work:
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login# The following default line doesn't block all tracker pages
Disallow: /tracker/
# this blocks paginated tracker pages
Disallow: /tracker?I block my RSS feeds (pathauto) from Googlebot with this line:
Disallow: /*/feedThat leaves the main RSS feeds still available to Google (/rss.xml). I use a custom front page to avoid duplicate content between the RSS feed and the home page.
These lines should go in the default Drupal robots.txt file because they are such common problems:
# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=
Comments
Why didn't you remove
Why didn't you remove trailing slashes from admin, search, etc.?
--
Websites: <a href="http://www.seo-expert-blog.com" title="SEO Expert Blog>SEO-Expert-Blog.com | Torlaune.de
Search
If you remove the trailing slash from "search" and someone makes a post called "Searching for the Truth", it would not get indexed:
http://example.com/searching-for-the-truth
The only advantage to removing the trailing slash on "search" would be that it would tell bots not to request http://example.com/search which 302 redirects to http://example.com/search/node (which is blocked by robots.txt). You could optionally do this:
Disallow: /search$Diasllow: /search/
http://example.com/admin sends a 403 header so would not be indexed by search engines even if they found it. If you remove the trailing slash you would block pages like http://example.com/administering-linux-servers
That is why I left as many trailing slashes as possible...
Blocking ?theme
J - ThAnks for the tips on the robots.txt file - solid stuff. I was wondering how to approach blocking pages that are duplicated for two themes. The url looks like /events-festivals/live-theater.html?theme=winter and /events-festivals/live-theater.html?theme=summer. Obviously this is creating some serious dup content issues. I am pretty rough with regex, so any insight to block the ?theme versions would be great. Thanks J.
on the use of wildcards
please note that the use of wildcards isnt allowed according the the 10+ year old specification. Google and yahoo however support it.
--
bert boerland
--
bert boerland
robots.txt drupal
Google, Yahoo, and MSN "Live" Search all support wildcards (*) and end-of-string ($) characters, so I think they are OK to use. You can make specific sections of the robots.txt file for different robots if necessary -- just put the specific robots' rules above the wildcard rules like this:
User-agent: Googlebot
# rules for Googlebot
User-agent: ia_archiver
Disallow: /
User-agent: *
# put rules for all other robots here at the end
patch
Now's the time to get this into Drupal6 - have you provided a patch and an issue yet?
--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour
knaddison blog | Morris Animal Foundation
Robots.txt
When is the deadline? I haven't used CVS before. If someone could help me get it submitted, I could provide a file/patch or whatever is needed (already created an updated robots.txt file and regular diff patch (not CVS patch)).
patch resources
See http://drupal.org/patch/create and see also the patch rolling video on drupaldojo.com which is available via a torrent.
Otherwise, come on down to #drupal or #drupal-dojo and someone can help.
--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour
knaddison blog | Morris Animal Foundation
robots.txt
The submitted patch is here:
http://drupal.org/node/180379
Hope that it was done correctly...
It's not perfect--I think it's impossible to write a single robots.txt file for all Drupal sites--but I tried to write it so that it would be as generic as possible.
thanks for that!
I have just set up a new Drupal site. At first Google was doing a fantastic job of indexing the site, but has in the past few days really gotten bogged down with links like "/user/register?destination=comment/reply/5%2523comment-form". I've changed the following robots.txt lines:
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
to
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
as you suggest, which should do the trick. Thanks very much for posting this!