Dealing With Duplicate Content in Drupal: My Approach

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
JohnForsythe's picture

I see this has already been covered a bit, but I wanted to share my own approach to dealing with Drupal's duplicate content issues. I've written a new article called How Duplicate Content Hurts Drupal Sites. In this article I outline how to use .htaccess and robots.txt to redirect and hide duplicate content from search engines, and discuss a few related modules.

Comments

Thanks John...

NikLP's picture

This is most helpful for newcomers to SEO with Drupal. I was actually looking for an up to date article on the matter round about the time you posted it!

I have got a couple of comments though; it would be helpful to know how much of an overhead globalredirect.module is creating over and above your suggested alternative. Obviously if I have a site which I manage for myself, I don't mind a bit of "pruning" work etc now and again. Conversely, it would probably be nigh on impossible to have a brochureware site owner doing this kind of work. Obviously, a balance needs to be struck in this case.

Furthermore, I have a new site for a client in the works, which uses global redirect, pathauto, page title, and all the other SEO gubbins. For his products he has stuff which sits in multiple categories and therefore I'm using taxonomies to manage this. However, the problem arises that because of this I have links thus:

example.com/finishes/finish-level/gloss
example.com/finishes/finish-level/mid-sheen

There's not much content in as yet but these two links basically display the same list of nodes (as all the finishes currently in the site are available as "gloss" and also "mid sheen"). At what point does this get regarded as duplicate content? Is this affected if I switch between teasers on/off? Should I override the default "Views" provided by Drupal on these pages, and if so, how?

Sorry if this changes the scope of your initial post too much, but I was going to ask about this here anyway! :)

Thanks again.

It's hard to say for sure

JohnForsythe's picture

It's hard to say for sure exactly what will trigger the duplicate content detection. From what I've read, Google just takes it's best guess based on an algorithm we'll never get to see. If it thinks your page is the same, or similar enough to another page, whoever is "more authoritative" gets listed, and the other disappears from the results. I've seen it happen myself with just one paragraph being the same. In some cases, both entries can disappear (I suspect this is a Google bug).

As far as overhead required for the global redirect module, I don't have any benchmarks. It's not going to be a huge problem, but basically you're waiting for Drupal to load and parse the URL on every page. And if it does find an alias or trailing slash, you get redirected and then you have to wait for Drupal to load and check the URL again. If you're concerned about performance, it's definitely worth using .htaccess instead. On a low traffic site, you might not notice any difference.

On second thoughts...

NikLP's picture

I would be grateful for a little more info here, I have had a bit more thinking time on this and realise I've not fully explained what I'm doing.

I used Xenu LinkSleuth to scan my site, and I don't have any duplicate urls as such. The urls listed above in my previous post point to what could in a certain light be considered "duplicate" content, insofar as they both can contain teasers which point to the same node. However, therein lies the point; the node is not actually duplicated in full, so is it safe to assume that google sees the teaser as "irrelevant" due to the fact that the teaser is viewed (by it) as part of the actual node, rather than content that is separate in its own right?

Bit of a grey area I think - would be useful to know; if this isn't true, I'm in a spot of trouble...! :)

I think it would be an issue...

jonnyp's picture

I think it would be an issue as the pages would be practically similar except for the title tags and heading tags, and the additional pages on the site mean a watering down of how much link juice each page recieves. This post from aaron wall was very helpful to me, suggesting that fewer, longer pages are better than chopping everything into very small pieces. He did a followup I can't find right now that went into more detail.

I'm grappling with this problem now on deciding how best to work with refine_by_taxonomy. I am converting an immense shopping site into drupal and want to add more detail on product features etc, and allow users to filter results by features (eg megapixels for digital camera, brand, model number etc) and am trying to do it with taxonomy terms as the filters. The problem is there are many different routes to finding a product, which while useful for the user could generate many duplicate results pages that would just end up in the supplemental index. I think the best plan is to nofollow most if not all of them, and leave key areas such as brand indexable.

Great article John!

chadj@drupal.org's picture

Although I think the real problem is not so much in having duplicate content as wasting precious link mojo pointing to duplicate content which Google is going to eliminate anyway. That problem is not solved with a robots.txt file at all!

To illustrate, link to a nonexistent page (say dummy.html) from another high-PR page. Even block the non-existent page in robots.txt. Now wait a few months for PageRank to recalculate and then create the dummy page, browse to it and, like magic, it has Google PageRank.

The point is that robots.txt only prevents pages from going into the Google index -- it does not prevent rank from being wasted on those nonexistent or duplicate pages.

The global redirect module, by automatically 301 redirecting to the correct URL, actually solves a much bigger problem then it sets out to. This is because, as you know, 301 redirects actually redirect and focus link juice on the target page.

Again, excellent article!

ChadJ


Ecommerce SEO Checklist
Experimental Website Monitor

What about Multisite

stuartgoff's picture

I am running Drupal Multisite with gatewayreservations.com, durangoreservations.org and more. So is there a way to implement the .htaccess method? by your example this would only work with the one site or would you just repete that for all sites?

Stuart

great articles

bharris-gdo's picture

very well done. very clear, and to the point.

i have one question, is there a way to modify the apache redirects so that instead of removing the trailing slash, it adds a trailing slash? i know that is some cases, the drupal core requires no trailing slashes, such a logging in and searching, but if there was a way to add the trailing slash except it those situations, that would be ideal. here's my reasoning.

example.com/products
example.com/products/product-name.html
example.com/products/product-name2.html

its very clear that /products/ is a directory of products which contains interior pages. where as /products looks like there would be no interior pages. maybe this is just a personal preference, but to me it seems logical.

almost forgot, also i am running multisite using a sub directory

example.com - site 1
example.com/site-two/ - site two

example.com/site-two (without trailing slash looks like a page, not a full directory/site)

may have answerd my own question

bharris-gdo's picture

so i got it to work, haven't fully tested it. I am not an apache wizard so this could probably be written better.

RewriteCond %{REQUEST_URI} !((login)|(search)|.|/$)
RewriteRule (.+)$ $1/ [R=301,L]

Menu Items have content...

TrinitySEM's picture

Since menu items can have content why not do:

example.com/products.html
example.com/products/product-name.html
example.com/products/product-name2.html

Just like in a static site. ;-)

301 from example.com/products to example.com/product.html if you do. Of course, new menu items wouldn't require that.

Duplicate content

kadimi's picture

Sorry for bumping this topic.

Please check this google serp: site:groups.drupal.org/node "Dealing With Duplicate Content in Drupal"

As you can see google returns 3 results, the 2nd and 3rd are not visible unless you specify it (repeat the search with the omitted results included), anyways , here are the 3 urls of the results, all 3 pages have the exact same content:

http://groups.drupal.org/node/3333
http://groups.drupal.org/node/3333?page=2
http://groups.drupal.org/node/3333?page=3

As you can see there is not trailing slash but ?page=X... How can we deal with those URLs?!

Things you can do

Rodgey's picture

Hi,

I haven't seen a comment suggesting 'webmaster tools' from Google.

With this service you can place a sitemap on your server with max 200 urls.

I would suggest you to:

  • install the global redirect module (enable some options to prevent for duplicate content)
  • sign up for Google webmaster tools
  • Send Webmaster tools your sitemap.txt file and your robot.txt file
  • you might also tell Google to skip all uls containing 'node' or some other object you can find in the urls you dont want to be crawled & make sure all your 'real urls' don't contain this.

Good luck!

you don't have a limit of 200

rogerpfaff's picture

you don't have a limit of 200 urls in webmaster tools sitemap if you subscribe it with xml. on our website it works with 2000+ urls.


Remember: I compute you!

thanks

Rodgey's picture

@rogerpfaff
Good job, i need that every now and then :)