Drupal Robots.txt

J. Cohen's picture

The default robots.txt file in Drupal 5.* has some problems. Also, the more modules one adds, the more duplicate content and low-quality URLs are created.

What robots.txt issues have people come across? Here are a few of my common modifications:

# Paths (clean URLs) -- modified from default
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/
Disallow: /contact
Disallow: /logout
Disallow: /search/

The /user paths should have the trailing slashes removed or they won't work:

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

# The following default line doesn't block all tracker pages
Disallow: /tracker/
# this blocks paginated tracker pages
Disallow: /tracker?

I block my RSS feeds (pathauto) from Googlebot with this line:

Disallow: /*/feed

That leaves the main RSS feeds still available to Google (/rss.xml). I use a custom front page to avoid duplicate content between the RSS feed and the home page.

These lines should go in the default Drupal robots.txt file because they are such common problems:

# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=

There is the first part of a Drupal robots.txt tutorial here (update: newer version here), but it would be difficult to add all the possible robots.txt rules needed for all modules.

Login or register to post comments

Why didn't you remove

yaph's picture
yaph - Thu, 2007-08-02 21:19

Why didn't you remove trailing slashes from admin, search, etc.?

--
Websites: <a href="http://www.seo-expert-blog.com" title="SEO Expert Blog>SEO-Expert-Blog.com | Torlaune.de


Search

J. Cohen's picture
J. Cohen - Thu, 2007-08-02 22:11

If you remove the trailing slash from "search" and someone makes a post called "Searching for the Truth", it would not get indexed:
http://example.com/searching-for-the-truth

The only advantage to removing the trailing slash on "search" would be that it would tell bots not to request http://example.com/search which 302 redirects to http://example.com/search/node (which is blocked by robots.txt). You could optionally do this:

Disallow: /search$
Diasllow: /search/

http://example.com/admin sends a 403 header so would not be indexed by search engines even if they found it. If you remove the trailing slash you would block pages like http://example.com/administering-linux-servers

That is why I left as many trailing slashes as possible...


Blocking ?theme

jgeorgerorg's picture
jgeorgerorg - Tue, 2008-01-15 22:38

J - ThAnks for the tips on the robots.txt file - solid stuff. I was wondering how to approach blocking pages that are duplicated for two themes. The url looks like /events-festivals/live-theater.html?theme=winter and /events-festivals/live-theater.html?theme=summer. Obviously this is creating some serious dup content issues. I am pretty rough with regex, so any insight to block the ?theme versions would be great. Thanks J.


on the use of wildcards

bertboerland's picture
bertboerland - Fri, 2007-08-03 07:39

please note that the use of wildcards isnt allowed according the the 10+ year old specification. Google and yahoo however support it.

--

bert boerland


robots.txt drupal

J. Cohen's picture
J. Cohen - Fri, 2007-08-03 13:53

Google, Yahoo, and MSN "Live" Search all support wildcards (*) and end-of-string ($) characters, so I think they are OK to use. You can make specific sections of the robots.txt file for different robots if necessary -- just put the specific robots' rules above the wildcard rules like this:

User-agent: Googlebot
# rules for Googlebot

User-agent: ia_archiver
Disallow: /

User-agent: *
# put rules for all other robots here at the end


patch

greggles's picture
greggles - Fri, 2007-09-28 02:22

Now's the time to get this into Drupal6 - have you provided a patch and an issue yet?

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour


Robots.txt

J. Cohen's picture
J. Cohen - Fri, 2007-09-28 05:30

When is the deadline? I haven't used CVS before. If someone could help me get it submitted, I could provide a file/patch or whatever is needed (already created an updated robots.txt file and regular diff patch (not CVS patch)).


patch resources

greggles's picture
greggles - Mon, 2007-10-01 20:34

See http://drupal.org/patch/create and see also the patch rolling video on drupaldojo.com which is available via a torrent.

Otherwise, come on down to #drupal or #drupal-dojo and someone can help.

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour


robots.txt

J. Cohen's picture
J. Cohen - Wed, 2007-10-03 00:49

The submitted patch is here:
http://drupal.org/node/180379

Hope that it was done correctly...

It's not perfect--I think it's impossible to write a single robots.txt file for all Drupal sites--but I tried to write it so that it would be as generic as possible.


thanks for that!

Truett's picture
Truett - Tue, 2007-12-11 19:48

I have just set up a new Drupal site. At first Google was doing a fantastic job of indexing the site, but has in the past few days really gotten bogged down with links like "/user/register?destination=comment/reply/5%2523comment-form". I've changed the following robots.txt lines:

Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

to

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

as you suggest, which should do the trick. Thanks very much for posting this!