Drupal Robots.txt

Events happening in the community are now at Drupal community events on www.drupal.org.
Z2222's picture

The default robots.txt file in Drupal 5.* has some problems. Also, the more modules one adds, the more duplicate content and low-quality URLs are created.

What robots.txt issues have people come across? Here are a few of my common modifications:

# Paths (clean URLs) -- modified from default
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/
Disallow: /contact
Disallow: /logout
Disallow: /search/

The /user paths should have the trailing slashes removed or they won't work:

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

# The following default line doesn't block all tracker pages
Disallow: /tracker/
# this blocks paginated tracker pages
Disallow: /tracker?

I block my RSS feeds (pathauto) from Googlebot with this line:

Disallow: /*/feed

That leaves the main RSS feeds still available to Google (/rss.xml). I use a custom front page to avoid duplicate content between the RSS feed and the home page.

These lines should go in the default Drupal robots.txt file because they are such common problems:

# Views and Forum module problem:
Disallow: /*sort=
# Image module problem
Disallow: /*size=

Comments

Why didn't you remove

yaph's picture

Why didn't you remove trailing slashes from admin, search, etc.?

--
Websites: <a href="http://www.seo-expert-blog.com" title="SEO Expert Blog>SEO-Expert-Blog.com | Torlaune.de

Search

Z2222's picture

If you remove the trailing slash from "search" and someone makes a post called "Searching for the Truth", it would not get indexed:
http://example.com/searching-for-the-truth

The only advantage to removing the trailing slash on "search" would be that it would tell bots not to request http://example.com/search which 302 redirects to http://example.com/search/node (which is blocked by robots.txt). You could optionally do this:

Disallow: /search$
Diasllow: /search/

http://example.com/admin sends a 403 header so would not be indexed by search engines even if they found it. If you remove the trailing slash you would block pages like http://example.com/administering-linux-servers

That is why I left as many trailing slashes as possible...

Blocking ?theme

jgeorgerorg's picture

J - ThAnks for the tips on the robots.txt file - solid stuff. I was wondering how to approach blocking pages that are duplicated for two themes. The url looks like /events-festivals/live-theater.html?theme=winter and /events-festivals/live-theater.html?theme=summer. Obviously this is creating some serious dup content issues. I am pretty rough with regex, so any insight to block the ?theme versions would be great. Thanks J.

on the use of wildcards

bertboerland's picture

please note that the use of wildcards isnt allowed according the the 10+ year old specification. Google and yahoo however support it.

--

bert boerland

--

bert boerland

robots.txt drupal

Z2222's picture

Google, Yahoo, and MSN "Live" Search all support wildcards (*) and end-of-string ($) characters, so I think they are OK to use. You can make specific sections of the robots.txt file for different robots if necessary -- just put the specific robots' rules above the wildcard rules like this:

User-agent: Googlebot
# rules for Googlebot

User-agent: ia_archiver
Disallow: /

User-agent: *
# put rules for all other robots here at the end

patch

greggles's picture

Now's the time to get this into Drupal6 - have you provided a patch and an issue yet?

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

Robots.txt

Z2222's picture

When is the deadline? I haven't used CVS before. If someone could help me get it submitted, I could provide a file/patch or whatever is needed (already created an updated robots.txt file and regular diff patch (not CVS patch)).

patch resources

greggles's picture

See http://drupal.org/patch/create and see also the patch rolling video on drupaldojo.com which is available via a torrent.

Otherwise, come on down to #drupal or #drupal-dojo and someone can help.

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

robots.txt

Z2222's picture

The submitted patch is here:
http://drupal.org/node/180379

Hope that it was done correctly...

It's not perfect--I think it's impossible to write a single robots.txt file for all Drupal sites--but I tried to write it so that it would be as generic as possible.

thanks for that!

Truett's picture

I have just set up a new Drupal site. At first Google was doing a fantastic job of indexing the site, but has in the past few days really gotten bogged down with links like "/user/register?destination=comment/reply/5%2523comment-form". I've changed the following robots.txt lines:

Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

to

Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

as you suggest, which should do the trick. Thanks very much for posting this!

Search Engine Optimization (SEO)

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: