Comment Permalinks and Duplicate Content

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
greggles's picture

I'm working on backporting the comment permalinks from Drupal 7 to Drupal 6 and am curious what the SEO folks think about the way it works.

The code was added to Drupal 7 in this commit as a result of this issue which is really about making the "recent comments" links work properly regardless of which page comments are on.

What this code does is:

  1. Create a new path - "comment/CID" which points to comment_permalink
  2. comment_permalink figures out which page the comment is currently on, then does some fakery in the menu system to pretend the request is for that page
  3. It then adds a canonical link in the header which points to the "real" version of the page
  4. And then it runs along letting the normal Drupal node/comment mechanisms render the rest of the page based on the fakery from step 2

My question is around the canonical link entry.

Is the canonical link enough to avoid duplicate content with this page? In the issue where it was added Dries suggests that additions to robots.txt wouldn't help. Anyone have thoughts on that?

Duplicate content seems to be a really murky area of SEO where some people say it doesn't matter, or that it matters but not much, or that certain ways to "fix" it don't matter. So, I'm interested in hearing all points of view.

Comments

Touchy Subject

binary basketball's picture

I'm fighting this issue currently on my vBulletin forum where I have the canonical URL defined but I still have multiple versions of the page being displayed which have been disallowed in robots.txt. Currently I have about 19,000 pages that are being prohibited by robots.txt on a community with 10k threads and 190k posts. I haven't taken a major hit yet but I can't see that it's helping.

I also kind of question the usage of canonical reference since technically the comment page isn't going to risk being considered an exact duplicate...

Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.

source - my link to Google's blog triggered the spam filter so you'll just have to take my word for that quote... :-)

Obviously it's going to be an exact duplicate of "some" of the content on the page but I don't think this usage is exactly what they had in mind since it's not a different version of a particular page. I'm interested to see what others have to say about this though.

As far as indexing goes though, I haven't noticed anybody indexing any of the pages I don't want to be indexed with my usage of canonical links and robots.txt exclusion

FlemmingLeer's picture

Hi Greggles,

I have two questions.

What happens when a cannonical link of a comment is used when the original node is deleted with all comments ?

I use comment page and it produces php error at present when calling deleted comment page in the Drupal 5.x version.
Comment page is here:
http://drupal.org/project/comment_page

Comment page also uses
comment/UID

Is it possible for you to allow updating the Drupal 5.x versions of comment page to your Drupal 6.x version as many will most likely update to Drupal 6.x once Drupal 7.x is released ?

Even a turtle reaches it´s goal...

it gives a 404

greggles's picture

Right now both 7.x and 6.x give a 404 error if the comment ID is "invalid" in some way (deleted, or nonsense data).

I think comment_page is not exactly the same feature as what I added to permalink (oh, by the way, the code is now added).

.

Z2222's picture

.

Agree but also Disagree

jasonrwd's picture

While I agree with J. Cohen that creating a different path for each comment is a bad idea, I cannot agree with the comment on Canonical links.

The Canonical link tag is not just a Google tag and has been adopted by all 3 major search engines. I have used it extensively and it does what it is was intended to do with similar to exact content pages.

Bing adopts Canonical Link tag:
http://www.bing.com/community/blogs/webmaster/archive/2009/02/12/partner...

Yahoo adopts Canonical Link tag:
http://www.ysearchblog.com/2009/02/12/fighting-duplication-adding-more-a...

By nature, search engines want to index more useful content. URLs with parameters and hashtags used to cause issues with indexing this deep content. Using the robots.txt (which is also recognized by all 3 major SEs) instead of a canonical link is going against what they are trying to accomplish with adopting this tag across all 3 search engines.

Jason

Rapid Waters Development
http://www.rapidwatersdev.com

.

Z2222's picture

.

The SEs treat the robots.txt

TrinitySEM's picture

The SEs treat the robots.txt as a "suggestion". It is a good idea to disallow duplicate content in the file but it is best to handle it with a meta robots noindex, follow. The rel canonical is a good approach to dup. content issues. They are paying more and more attention to this because, in my opinion, their computing power will be affected by real-time search. They don't want to waste it on duplicate URLs.

Matt Cutts also indicated that meta robots noindex will, for the most part, exclude the page from the index and is the best method for doing so.

Using nofollow links to sculpt PR is no longer productive. Over a year ago if you had two links, each would pass 50% of the PR of the page. A nofollow on one link would then channel 100% of the PR to the other link. That has changed. They are now passing 50% to each link and if one is nofollowed the PR is not passed. It vanishes. It is better to pass the PR to a page and then point it to another page. There is some bleedoff but you get the idea. This also explains my reasoning behind using the "follow" attribute vs. "nofollow".

.

Z2222's picture

.

"they will index the

TrinitySEM's picture

"they will index the existence of a page blocked by robots.txt if it has links to it."

Hence the "suggestion". They will still index pages disallowed from robots.txt. Not so likely with meta robots.

"I think that using noindex, follow as a metatag creates unpredictability"

Can you elaborate?

"If you don't want a page to appear in search engines, but you want Google to follow it just create your website structure so that fewer links point to those pages than to those pages you want to appear in the index."

1 link can do it and you can't control third parties.

.

Z2222's picture

.

"They won't index the pages,

TrinitySEM's picture

"They won't index the pages, just the existence of the pages."

They will list the page in the SERPs which provides an entry point.
http://www.youtube.com/watch?v=TkopkUPF-M8

Matt Cutts: "Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results."

and...

Eric Enge: "Can a NoIndex page accumulate PageRank?"
Matt Cutts: "A NoIndex page can accumulate PageRank, because the links are still followed outwards from a NoIndex page."
Eric Enge: "So, it can accumulate and pass PageRank."
Matt Cutts: "Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make a NoIndex page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages."

Citation: http://www.stonetemple.com/articles/interview-matt-cutts.shtml

"If someone links to the page, it usually means it has something worth reading :)"
The way I read greggles original question I thought the objective was to block the page. If not the case then that is another matter.

.

Z2222's picture

.

"That's what I was saying

TrinitySEM's picture

"That's what I was saying about robots.txt -- the existence of the page is indexed, but not the page."
we both agree on that point. The part that is missing is that the SERPs can display a link to the page thereby creating an entry point.

"not everything Matt Cutts says is necessarily good advice."
I wouldn't argue that advice can be slanted towards G.

"IMHO, it's better to use site architecture and robots.txt to control robots than to use noindex,follow."
I agree that it is best to not have the dup. content issue in the first place. I also agree that robots.txt should be respected. In reality, it isn't infallible.

So, Greg, hopefully we've provided enough information for you to make a decision and J. Cohen and I can pee on another tree. ;-)

.

Z2222's picture

.

Search Engine Optimization (SEO)

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week