Another Approach to Trees

Events happening in the community are now at Drupal community events on www.drupal.org.
valante's picture

Hi all,

In the last 6 years or so I've been working with Dr. Tzachy Yuli on a brilliant custom ERP that has a strong resemblance to Drupal. Our ERP included three components only: Entities, Trees, and Collections. I won't get into a full discussion of these elements (though I think Drupal can benefit a lot from this approach to data-basing -- if anyone's interested, I'd be glad to elaborate), but I do want to share here our experience with trees.

Trees are a major, major part of our ERP. They define the language of the organization, and so we use them all the time. One thing we've noticed about trees, is that you very often read them, but much more rarely edit them. This distinction helped us choose an optimal data representation format which I think Drupal will enjoy greatly.

Instead of the (ID, Parent-ID) format, we chose to represent each tree-node as (ID, Node-String). The Node-String is a series of period-separated numbers (much like an IP address) that records the exact position of the tree-node in the tree, beginning at the root.

For example, with the Node-String in parenthesis:

Root-level term (1)
-- 1st level child (1.1)
---- 2nd level child (1.1.1)
---- Another 2nd level child (1.1.2)
-- Another 1st level child (1.2)
Another root-level term (2)
-- Another 1st level child (2.1)

And so on.

Notice that just by looking at a Node-String, you can immediately know the tree-node's depth and its order among its siblings. This is one big advantage of Node-Strings.

But it only scratches the surface.

No need to tell you how complicated, time- and space-consuming is the recursive tree function Taxonomy currently uses. Forget all about it. With Node-String, you can retrieve an entire tree with a single SQL command--simply select the entire tree table and sort it by the Node-String (more on that below).

Even better, you can just as easily get any sub-tree. Given the requested tree-node's ID, check its Node-String, then retrieve all tree-nodes that begin with that Node-String. Voila.

Want depth-limited sub-trees? Retrieve all tree-nodes that begin with the requested Node-String and only have that many periods in their own Node-String.

Need to find the parent of a tree-node? Lop off the last segment of its Node-String, and you have the parent's Node-String. Need to find its next sibling? Increment the last Node-String segment and see if such a tree-node exists.

Finally, Views will be able to decently sort by taxonomy. Book navigation? Piece of cake. Menus and sub-menus? Peanuts.

I'm sure you see the possibilities.

Of course, there's no free lunch, so here come the downsides.

For one, it's much harder to edit a tree with Node-Strings. Move, delete, or insert one tree-node, and you have to update the Node-Strings of everything. But then, I started this post by stating that trees are much more often read than written. I consider this a fair trade-off, especially since the end user who expects the smoothest experience almost always reads the trees but does not write them.

Keeping a Node-String takes more DB space than simply storing the parent's ID. But Drupal often sacrifices space in favor of smooth running, and rightly so. This one's negligible in my eyes.

Last but not least, the question of sorting trees has a minor complication. Namely, it hinges upon the alphabetical sorting of the Node-String, in which "1.10" unfortunately comes before "1.2". There are two solutions for this. One is to create a special comparison function that converts the string into an array of numbers and evaluates them as such. (This is possible with SQL Server 2005/8, I'm not sure about MySQL).

The other, much simpler solution is to write each Node-String with placeholders for a predetermined number of tree-nodes; for example, "000" instead of "0", "000.001" instead of "0.1", and so on. This gives you instant string sorting at the cost of two things: one, more space (which again I find is negligible); and the other, a limit on how many siblings may reside under each node (in the above example, 999). Here I would plea that trees with more than a thousand siblings per tree-node are more flat lists than trees, since they no longer make tree-sense to the human brain. I find this constraint inelegant but sufferable.

In my 3+ years of working with Drupal (still a novice, I know), I found this advanced approach to trees the principal thing missing in an otherwise extremely powerful CMS. I know it's quite an about-face for the Taxonomy module, but I believe it would be worth the effort across the board.

I'd also like to state again that the ideas presented here are mostly the brainchildren of Dr. Tzachy Yuli--I have no personal claim on them, except for the privilege of working alongside him.

So. Your thoughts, show me them :)

Tal

Comments

C'mon, no takers?

valante's picture

C'mon, no takers?

An amazing concept...

DeeZone's picture

An amazing concept that makes perfect sense based on my very, very past experience on the same topic. I'm following this thread in the hopes that those closer to the core taxonomy module will take note. I have no idea who you should poke specifically but I suspect this post might get lost in the chatter.

This is very similar to

catch's picture

This is very similar to what's used in core for comment threading. The main limitation with it and other materialized path implementations is it's not possible to have the same item at two different points in a tree - which makes it hard to apply to taxonomy.

There's some discussion of various approaches to this in this issue http://drupal.org/node/344019 (has not been updated recently).

chx also posted a new module recently which is trying to tackle the same issue but with a different encoding http://drupal.org/project/entity_tree

So overall, there's been some work on this but it's very thorny and nothing's really made it into core.

Thanks, DeeZone and

valante's picture

Thanks, DeeZone and catch!

catch, I see what you mean. Let's put aside the fact I don't think a single tree-node should physically belong in two places in a tree. The issue you bring up can be easily solved by using a one-to-many relationship (called in our ERP a Collection). A tree-node will have a collection of Node-Strings:

Tree table: ID, Name, Whatever (ID is primary key)
Tree structure table: ID, Node-String (ID & Node-String together are primary key)

(Edited to clarify that the ID in the tree structure table is a pointer to the tree table ID.)

Everything should work the same and just as easily once you combine the two tables with INNER JOIN.

I'll go read those other discussions you mentioned . . .

Taxonomy

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: