Drupal Robots.txt

Posted by Z2222 on July 31, 2007 at 5:36pm

The default robots.txt file in Drupal 5.* has some problems. Also, the more modules one adds, the more duplicate content and low-quality URLs are created.

What robots.txt issues have people come across? Here are a few of my common modifications:

# Paths (clean URLs) -- modified from default
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/
Disallow: /contact
Disallow: /logout
Disallow: /search/

The /user paths should have the trailing slashes removed or they won't work:

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

# The following default line doesn't block all tracker pages
Disallow: /tracker/
# this blocks paginated tracker pages
Disallow: /tracker?

I block my RSS feeds (pathauto) from Googlebot with this line:

Disallow: /*/feed

That leaves the main RSS feeds still available to Google (/rss.xml). I use a custom front page to avoid duplicate content between the RSS feed and the home page.

These lines should go in the default Drupal robots.txt file because they are such common problems:

# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=

Comments

Why didn't you remove

Posted by yaph on August 2, 2007 at 9:19pm

Why didn't you remove trailing slashes from admin, search, etc.?

--
Websites: <a href="http://www.seo-expert-blog.com" title="SEO Expert Blog>SEO-Expert-Blog.com | Torlaune.de

Search

Posted by Z2222 on August 2, 2007 at 10:11pm

If you remove the trailing slash from "search" and someone makes a post called "Searching for the Truth", it would not get indexed:
http://example.com/searching-for-the-truth

The only advantage to removing the trailing slash on "search" would be that it would tell bots not to request http://example.com/search which 302 redirects to http://example.com/search/node (which is blocked by robots.txt). You could optionally do this:

Disallow: /search$
Diasllow: /search/

http://example.com/admin sends a 403 header so would not be indexed by search engines even if they found it. If you remove the trailing slash you would block pages like http://example.com/administering-linux-servers

That is why I left as many trailing slashes as possible...

Blocking ?theme

Posted by jgeorgerorg on January 15, 2008 at 10:38pm

J - ThAnks for the tips on the robots.txt file - solid stuff. I was wondering how to approach blocking pages that are duplicated for two themes. The url looks like /events-festivals/live-theater.html?theme=winter and /events-festivals/live-theater.html?theme=summer. Obviously this is creating some serious dup content issues. I am pretty rough with regex, so any insight to block the ?theme versions would be great. Thanks J.

on the use of wildcards

Posted by bertboerland on August 3, 2007 at 7:39am

please note that the use of wildcards isnt allowed according the the 10+ year old specification. Google and yahoo however support it.

bert boerland

robots.txt drupal

Posted by Z2222 on August 3, 2007 at 1:53pm

Google, Yahoo, and MSN "Live" Search all support wildcards (*) and end-of-string ($) characters, so I think they are OK to use. You can make specific sections of the robots.txt file for different robots if necessary -- just put the specific robots' rules above the wildcard rules like this:

User-agent: Googlebot
# rules for Googlebot

User-agent: ia_archiver
Disallow: /

User-agent: *
# put rules for all other robots here at the end

patch

Posted by greggles on September 28, 2007 at 2:22am

Now's the time to get this into Drupal6 - have you provided a patch and an issue yet?

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

knaddison blog | Morris Animal Foundation

Robots.txt

Posted by Z2222 on September 28, 2007 at 5:30am

When is the deadline? I haven't used CVS before. If someone could help me get it submitted, I could provide a file/patch or whatever is needed (already created an updated robots.txt file and regular diff patch (not CVS patch)).

patch resources

Posted by greggles on October 1, 2007 at 8:34pm

See http://drupal.org/patch/create and see also the patch rolling video on drupaldojo.com which is available via a torrent.

Otherwise, come on down to #drupal or #drupal-dojo and someone can help.

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

knaddison blog | Morris Animal Foundation

robots.txt

Posted by Z2222 on October 3, 2007 at 12:49am

The submitted patch is here:
http://drupal.org/node/180379

Hope that it was done correctly...

It's not perfect--I think it's impossible to write a single robots.txt file for all Drupal sites--but I tried to write it so that it would be as generic as possible.

thanks for that!

Posted by Truett on December 11, 2007 at 7:48pm

I have just set up a new Drupal site. At first Google was doing a fantastic job of indexing the site, but has in the past few days really gotten bogged down with links like "/user/register?destination=comment/reply/5%2523comment-form". I've changed the following robots.txt lines:

Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

as you suggest, which should do the trick. Thanks very much for posting this!

Comments

Why didn't you remove

Search

Blocking ?theme

on the use of wildcards

robots.txt drupal

patch

Robots.txt

patch resources

robots.txt

thanks for that!

Search Engine Optimization (SEO)

Group organizers

New groups

Group notifications

Hot content this week