Views query optimization

bjaspan's picture

I am working on "fields in core" this week and recently found myself pondering the problem of very inefficient CCK-related Views queries. Here's a brief summary of the situation:

In D6 (and, through overwhelming though unjustified demand, probably in D7), CCK fields can exist in "per content type" tables that share many single-value fields or in "per-field" tables that contain a single field that is either shared among multiple content types, is multi-valued, or both. Also, in D6, the admin can change a CCK field from one to the other form of storage, perhaps without realizing it (not all admins are RDBMS wizards). If CCK exported information about fields to Views in a way that was tied to the actual database layout, then when an admin changed the database layout of a field, all Views using that field would break and require manual intervention to fix.

Therefore, when CCK exports its fields to Views, it always tells Views that every field is in its own table, even if it is not. Suppose that you define an Event type with a date and venue nodereference field:

Table content_type_event: nid, vid, field_date_value, field_venue_nid

CCK tells views that two tables exist, named something like "node_data_field_date_value" and "node_data_field_venue_nid." They both happen to be aliased to content_type_event, but Views treats them as separate tables. So, if you add both Date and Venue as fields to a View, you get a query something like this:

SELECT node_data_field_date_value.field_date_value, node_data_field_venue_nid.field_venue_nid
FROM node n
LEFT JOIN content_type_event node_data_field_date_value ON node_data_field_date_value.vid = n.vid
LEFT JOIN content_type_event node_data_field_venue_nid ON node_data_field_venue_nid.vid = n.vid

In other words, you get two joins to the same table (content_type_event) with the same criteria.

Here's my question: Can't this query be dynamically optimized to eliminate the redundant join? If I'm right, Views can do it, or a contrib module could do it with hook_views_query_alter().

My thinking is that when you have two joins to the same table using the same "ON" criteria, they can be merged. So:

SELECT t1.field_a, t2.field_b
FROM node n
JOIN table t1 ON t1.col = n.col
JOIN table t2 ON t2.col = n.col

can be rewritten as:

SELECT t.field_a, t.field_b
FROM node n
JOIN table t ON t.col = n.col

Am I missing something? When is this not a valid transformation? I'm sure that I am going down a well-worn path here and am either re-inventing something from the 1970s or missing the obvious problem with this approach. If I am missing something and this is only sometimes valid, CCK could certainly add a setting to the table definition it exports to Views saying that for this table it is a safe transformation.



In fact, now that I think

bjaspan's picture

In fact, now that I think about "re-inventing something from the 1970s," I bet if this is a safe transformation that most RDBMS systems already implement this in their internal query optimizer. Mr. Strauss?

I believe that if using

merlinofchaos's picture

I believe that if using $query->ensure_table() it would be safe to make this optimization. However, if using $query->add_table() it would not be safe to make this optimization.

My guess is that CCK never uses $query->add_table() but my belief may be suspect.

We discussed this issue with

yched's picture

We discussed this issue with Earl and Karen on IRC a few weeks ago, the conclusion was : 'another good reason why field data should not move from one table to the other / one storage schema to the other after the field got created'. But I don't think we considered rewriting the query... Probably is a solution in a 'movable fields' world...

CCK field handlers do not implement the ->query() method, and thus defer to the regular views_handler_field_node::query()
(except for multiple fields, where it simply doesn't add anything to the views-genereted query, and retrieves data in a pre_render separate query)

I think the D7 solutions we

KarenS's picture

I think the D7 solutions we discussed was that we would not move fields from one table to another and could therefore get rid of our aliased table names (which is the main reason why you have a join for every field). The aliases have been used because the table names could change at any time so they weren't safe to use. If we don't move fields around and re-write the Views tables to use the real table name, I believe Views will create the proper joins of one join for every real table (not tested).

This all depends on getting our field structure nailed down and following up on the plan to lock fields in place once they have data in them.

And Earl is right that CCK

KarenS's picture

And Earl is right that CCK core is not using $query->add_table() anywhere. The one place we talked about it was for nodereference so we could join in other nodereference fields, but that is handled in Views 2 with relationships so is not needed any more.

Don't know about the field modules, though.

Ok, Barry: look into

merlinofchaos's picture

Ok, Barry:

look into views_query::ensure_table() and have a look. I think you only need to basically go through the list of tables and compare the join information. You must compare relationships as well, since the same table can be linked onto relationships which makes it a very different join even though it will look the same. This might make it a little bit difficult to be sure, but I believe this can be done and would be a good performance enhancement for CCK.

Oh FYI a contrib module

merlinofchaos's picture

Oh FYI a contrib module could not do it, because $query::ensure_table() needs to return the proper alias for the table.

Patch underway

Not quite

David Strauss's picture

"My thinking is that when you have two joins to the same table using the same 'ON' criteria, they can be merged."

If I modify your statement to read "two joins of the same type (LEFT, RIGHT, or INNER)," I'd say it's correct at least for the purposes of Views.


bjaspan's picture

Yes, of course the join type must be the same, but thank you for pointing it out explicitly. As it turns out my patch already checks that along with all the other details of the join data structure.

Performance testing?

mikeryan's picture

Just curious - has anyone compared the performance between the optimized and non-optimized versions of the query? As your first comment suggests, I would think the database engine would be capable of optimizing this on its own.

Not that it doesn't have value, even if the db performance savings is minimal - I've been debugging some hairy Views queries, and this optimization would certainly make it a lot easier to see what's going on...

Mike Ryan

Mike Ryan

Even if it doesn't improve

merlinofchaos's picture

Even if it doesn't improve performance, there's still a maximum number of joins and this could seriously impact that. That said, it would be worth running some simple tests.