DRIOMETRICS: An Idea for Automating Identification of Duplicate Modules

Aveu's picture

I have an idea for dealing with duplicate modules at a more (but not fully) automatic level.

As I understand the problem duplicates are caused by one of the following situations:

  1. The Unknown Comic Module: This is where a coder has a "novel" idea not aware that module# 3,247 already does the same thing. Of course all coders are supposed to be intimately familiar will all 6000+ modules so this should rarely happen. When a coder makes this kind of mistake obviously we should put a paper bag over their HEAD and sign them up for the next comedy club amateur night where they can really embarrass themselves. ;)

  2. Close But No Cigar: This is like #1 except in this case the "unknown" module does what the coder wants to do, it just does it for a different content type than the coder is trying to manipulate and adapting the older module to do the same thing for a new type of content would be trivial. Since there are only a very few content types this should not be a serious problem as far as I can tell. ;)

  3. Fork You Very Much: This is where a coder says; "I want to take this idea in a different direction." This can occur for numerous reasons from as pure as exploring a new approach to an old problem, to petty politics between members of a team of coders. In the end the reasons are not as important as the fact that these "duplicates" are created by conscious free choice which is often the genesis of a trivial little thing called "innovation".

I could be wrong but it seems to me the vast majority of duplicates fall into Category #1 or #2 and that is where I hope my idea can help. I understand that users are asked to check for duplicates before starting projects but as Drupal grows this is going to become harder and harder to do. Frankly it is nearly impossible right now. HOWEVER ...

If some of the brighter coders in our flock could create some simple text scanning code to identify just the cross-references of each module. This does not have to be fancy, all that is needed is a simple index listing showing a count for all API calls and external identifiers used in a given module file. A big plus, if possible, would be somehow identifying READs versus WRITEs for external variables and data-tables.

This index listing would be sort of like a crude "fingerprint" of the file (later we can enhance this process with more advanced biometric analogs like retinal scans and DNA analysis but for now let's just start with the easy stuff). Once the existing library of code has been scanned (this could be done offline or in the background) and each module file has an associated "fingerprint" then when a new commit is made the committed file is scanned and a new fingerprint file is created just for it.

Now comes the good part: Since the fingerprint is essentially an XREF count, not concerned with flowcharting, the new fingerprint could be scored against other fingerprints and the committer would be sent an email identifying the 10 most similar fingerprints to the file just committed. The committer would be asked to use the list to take a look at those files and determine if there was overlap and or even perhaps useful ideas in those other projects.

In the end, human judgment would be the deciding factor but at least in this way the coders would have a fighting chance to see where their code fits in with the thousands of other modules in the Drupal repository.

As for Category #3 there is not and probably should not be a "solution" since to interfere in another coder's free choices utterly spits in the face of the principals of the Drupal Project. The one thing we can do is require that when modules are intentionally forked, that the coder creating the fork clearly labels it as such along with highlights of the technical reasons for the fork (but not the political reasons if any). A footnote should be placed on the project description of both the forked and the original projects thus creating an easy to see relationship. Perhaps even a special field could be added to the repository exclusively for notes about forking details. In this way users can be informed of the variances, compare the modules and make their own free & informed choices.

NOTES: The fingerprint algorithms will likely need to be tweaked at first so expect some re-fingerprinting of the library in the early stages. Some references might need to be weighted (such as security related identifiers) and the print-comparison algorithm is likely to need some work as well. Some of the ideas in the UML world might also be useful to consider. This is going to be a long term project and should not be given any kind of arbitrary target date since it is for internal utilitarian purposes and not market-driven in any way.

Comments

Project Name

Aveu's picture

This is just trivia but it occured to me that I should clarify that the project name I propose, DrioMetrics, is based on the word "biometrics" so it should be pronounced "dry-o-metrics" (as in eye) and not "dree-o-metrics" (as in trio). :)