Fighting duplicate nodes

Events happening in the community are now at Drupal community events on www.drupal.org.
crea's picture

I am currently making a site where community members will fill a directory. Each directory entry should be unique. I am searching for modules to fight with duplicates problem - same as with tags, from time to time users will add duplicate node with slightly different titles. There will be 2 problems:

  1. Finding a dupe
  2. This is not easy as it may seem. Titles, fields, and other data won't exactly match: there will be some factor of similarity (otherwise I would just use Unique Field module). Ok, it may be easy at the start cause dupes will be manually found, and later it will be possible to use the power of community to help find dupes (e.g. Flag module to flag "duplicate").

  3. Dealing with dupes found.
  4. It's most difficult (both from the point of programming and UI). If nodes merge, what node stays alive ? Or will it be completely new node ? How to merge fields ? How to merge efficiently ? How to deal with foreign keys (e.g. nodereferences) to nodes being merged efficiently ?

Any ideas ? Any module that does the job that I missed ?

Comments

uniqueness module

greggles's picture

For the first problem, consider the uniqueness module: http://drupal.org/project/uniqueness

For the second...that's a bit tougher to do and I'm not sure of anything that can help with it.

Thanks for the uniqueness

crea's picture

Thanks for the uniqueness module reference.

Also looking

angheloko's picture

Hi,

I'm also looking for this functionality.

What I could think of right now is to:

1) Create a link 'Report duplicate', which will redirect the user to a page. This page will automatically search for nodes using the reported node's details, such as the title.

(I am thinking of using views for this one or maybe using the Search module's hooks...)

2) The page will also contain a Search form for the user to be able to do a manual search.

(We can redirect user to the 'Search' page and maybe pre-populate some values and trigger the search...)

3) Results will have a checkbox, which will allow user to select the nodes that he deems as dups.

(If it's going to be done using Views, we can add a checkbox in the theming part.... If it's going to be done using the Search module, maybe some hooks or again theming part...)

4) Once submitted, it can be reviewed by the site admin and do whatever administration he needs to do. For my case, what I want is that approved reports to be displayed in the reported node's page. The dups node pages will have a message saying that this node has been reported to be a dup of this node. Comments or any interaction with the dup node is disabled.

(I can sense a lot of altering here... or maybe Rules module...)

merging and deduping

xenophyle's picture

I have been trying to write a simple form to allow removing duplicate nodes (or merging two nodes that should have been one). The user would have to enter two node ids, and then clicking a button would show a preview of all the fields after merging. The user could edit these fields before submitting the merge. It gets complicated when I consider how to do this exactly: do I use hook_form? Do I create a custom form?

This sounds similar to what

wxman's picture

This sounds similar to what I've been puzzling about.
I've been redesigning our book site from scratch, and we thought we were just about ready to go live when it dawned on me a fatal flaw. We could end up having the same book entered in many times, as different nodes.

I set it up so people can enter their own published book info, back lists, book reviews, favorite books, bestsellers, etc. Visitors could see books listed by the node type they were entered under, or broken down by what kind of writing they are (fiction, poetry, etc.). How I didn't see that this was going to run into problems is beyond me.
A person can enter their new fiction book as a node type "Book Release"; meanwhile, the same book is added as a "bestseller" by another, and still another adds it to their "Favorites" list. Now if you click the 'Fiction' tag, you see a bunch of copies of the same book, all from different nodes. The system sees nothing wrong because they all came from different nodes.
I'm trying out some ideas to see if I can't find a solution, but I haven't had much luck so far.

ISBN?

TheUnseen's picture

If you mean real books (not drupal's book type), what about requiring the ISBN or anything?
http://en.wikipedia.org/wiki/International_Standard_Book_Number

It should make finding duplicates easier, or even impossible for many cases.

Use an Amazon ASIN field with ISBN

rfay's picture

This is actually off-topic I think but might help with your specific problem. Amazon module's ASIN CCK field accepts ISBN numbers and can be used as a field in your content type. It might be useful for your specific approach. Lots of people use it for similar review-type sites.

Unique field

kristen pol's picture

You can set up your content types so that certain fields or combinations of fields must be unique using this module:

http://drupal.org/project/unique_field

Very handy!
Kristen

Contributed Module Ideas

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: