Posted by z.stolar on November 1, 2007 at 8:08am
Hello all,
I'm using the Taxonomy: term ID argument as a view argument, and I'm passing to it a list of about 20 term IDs (it changes from page to page).
The result is a huge query that chokes the MySQL server, and drops the site down. The query is pasted further down this mail.
After manually changing the query to use IN(tid, tid2, tid3...), and getting rid of many LEFT JOINS, the query ran without any problem, and quite fast.
I'd appreciate pointers to issues regarding this problem, if you know of any...
And for the curious ones, take a look at the query:
BRFORE
SELECT DISTINCT(node.nid), node.title AS node_title, node.changed AS node_changed FROM node node
LEFT JOIN content_type_library_item node_data_field_in_library ON node.vid = node_data_field_in_library.vid
LEFT JOIN term_node term_node ON node.nid = term_node.nid LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid LEFT JOIN term_node term_node2 ON node.nid = term_node2.nid LEFT JOIN term_hierarchy term_hierarchy2 ON term_node2.tid = term_hierarchy2.tid LEFT JOIN term_node term_node3 ON node.nid = term_node3.nid LEFT JOIN term_hierarchy term_hierarchy3 ON term_node3.tid = term_hierarchy3.tid LEFT JOIN term_node term_node4 ON node.nid = term_node4.nid LEFT JOIN term_hierarchy term_hierarchy4 ON term_node4.tid = term_hierarchy4.tid LEFT JOIN term_node term_node5 ON node.nid = term_node5.nid LEFT JOIN term_hierarchy term_hierarchy5 ON term_node5.tid = term_hierarchy5.tid LEFT JOIN term_node term_node6 ON node.nid = term_node6.nid LEFT JOIN term_hierarchy term_hierarchy6 ON term_node6.tid = term_hierarchy6.tid LEFT JOIN term_node term_node7 ON node.nid = term_node7.nid LEFT JOIN term_hierarchy term_hierarchy7 ON term_node7.tid = term_hierarchy7.tid LEFT JOIN term_node term_node8 ON node.nid = term_node8.nid LEFT JOIN term_hierarchy term_hierarchy8 ON term_node8.tid = term_hierarchy8.tid LEFT JOIN term_node term_node9 ON node.nid = term_node9.nid LEFT JOIN term_hierarchy term_hierarchy9 ON term_node9.tid = term_hierarchy9.tid LEFT JOIN term_node term_node10 ON node.nid = term_node10.nid LEFT JOIN term_hierarchy term_hierarchy10 ON term_node10.tid = term_hierarchy10.tid LEFT JOIN term_node term_node11 ON node.nid = term_node11.nid LEFT JOIN term_hierarchy term_hierarchy11 ON term_node11.tid = term_hierarchy11.tid LEFT JOIN term_node term_node12 ON node.nid = term_node12.nid LEFT JOIN term_hierarchy term_hierarchy12 ON term_node12.tid = term_hierarchy12.tid LEFT JOIN term_node term_node13 ON node.nid = term_node13.nid LEFT JOIN term_hierarchy term_hierarchy13 ON term_node13.tid = term_hierarchy13.tid LEFT JOIN term_node term_node14 ON node.nid = term_node14.nid LEFT JOIN term_hierarchy term_hierarchy14 ON term_node14.tid = term_hierarchy14.tid LEFT JOIN term_node term_node15 ON node.nid = term_node15.nid LEFT JOIN term_hierarchy term_hierarchy15 ON term_node15.tid = term_hierarchy15.tid LEFT JOIN term_node term_node16 ON node.nid = term_node16.nid LEFT JOIN term_hierarchy term_hierarchy16 ON term_node16.tid = term_hierarchy16.tid LEFT JOIN term_node term_node17 ON node.nid = term_node17.nid LEFT JOIN term_hierarchy term_hierarchy17 ON term_node17.tid = term_hierarchy17.tid LEFT JOIN term_node term_node18 ON node.nid = term_node18.nid LEFT JOIN term_hierarchy term_hierarchy18 ON term_node18.tid = term_hierarchy18.tid LEFT JOIN term_node term_node19 ON node.nid = term_node19.nid LEFT JOIN term_hierarchy term_hierarchy19 ON term_node19.tid = term_hierarchy19.tid
WHERE (node_data_field_in_library.field_in_library_nid = 2045) AND (term_node.tid = '216') AND (term_node2.tid = '24') AND (term_node3.tid =
'10') AND (term_node4.tid = '148') AND (term_node5.tid = '25') AND (term_node6.tid = '139') AND (term_node7.tid = '11') AND (term_node8.tid = '14') AND (term_node9.tid = '175') AND (term_node10.tid = '212') AND (term_node11.tid = '213') AND (term_node12.tid = '214') AND (term_node13.tid = '30') AND (term_node14.tid = '33') AND (term_node15.tid = '35') AND (term_node16.tid = '18') AND (term_node17.tid = '20') AND (term_node18.tid = '3') AND (term_node19.tid = '2') GROUP BY node.nid, node_title ORDER BY node_title ASC LIMIT 0, 50;AFTER
SELECT DISTINCT(node.nid), node.title AS node_title, node.changed AS node_changed FROM node node
LEFT JOIN content_type_library_item node_data_field_in_library ON node.vid = node_data_field_in_library.vid
LEFT JOIN term_node term_node ON node.nid = term_node.nid
LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid
WHERE (node_data_field_in_library.field_in_library_nid = 2045)
AND term_node.tid IN ('216','24','10','148','25','139','11','14','175','212','213','214','30','33','35','18','20','3','2')
GROUP BY node.nid, node_title ORDER BY node_title ASC LIMIT 0, 50;
Comments
How many times can a table be JOINed to itself?
Seems that I had a small mistake, but it made me understand a critical limitation of views or perhaps of the way taxonomy is built:
In the above rewriting of the SQL, I actually created a completely different query, so the fix dosen't really fix - I've made the query an OR query instead of AND.
Now, digging a bit more inside (with the kind help of yhager), I now understand that the problem lies in the way taxonomy tables are built, more exact, the term_node table: When we want to choose a node that has term1 AND term2, we JOIN the term_node table to itself. The question is - how scalable is that method, and could there possibly be another method? Another sophisticated SQL syntax?
core problem
this is how care taxonomy has worked since 2002 when it was born. this is not related to Views. You get the same query if you do taxonomy/term/3,4,5 for example (tids might be separated by +, i can't remember which is which)
I certainly don't know of
I certainly don't know of another way to do AND with terms. It gets even worse when you add depth, too.
New table in the database?
Are the terms you want hard-coded, or can the user select from a whole bunch of selects? There might be a solution to this, but it's a bit fussy.
Basically what this does is spreads out the hit of the query in question a bit, it's like doing the joining when you are entering the data. It's a bit of a hackish solution that I haven't actually tried out.
Actually having posted this solution, I've found some reasons why it won't work (in general).
We can't store the information as an integer because MYSQL maximum integer size is 18446744073709551615 (which is 264) which would limit you to 64 terms maximum. If this limitation doesn't affect you, then store the sum of the powers of 2 of the individual terms.
If we store the information as a string, then whenever we add new taxonomy terms, we'll have to update the string already in the database so that they show up in our search, which sounds query heavy to me.
More simply, you could create this table and store a bunch of 1's and 0's and save yourself the joins later on. Not as efficient as I was hoping for but, oh well.
Sorry, I don't see a general solution that works completely either, but I'm going to post this partial solution so we all can have a good laugh over it.
Dave
Will views2 make it better?
It was funny to see the huge query.
Will views2 make it better?
Amnon
-
Professional: Drupal Israel | Drupal Development & Consulting | Effective Hosting Strategies | בניית אתרים
Personal: Hitech Dolphin: Regain Simple Joy :)
Can hand-optimized Views queries be run thru Views theming?
We have some very complex views queries running on www.urbanministry.org (on our volunteer & organization search pages - /volunteer & /org) that we'd like to optimize, as we have for our other site, www.christianvolunteering.org.
z.stolar, you said that you "manually changed the query" & it ran faster. Do you mean that you actually got Views to run the optimized version of the query or just that you tested it in MySQL? I would love to leverage the Views theming framework to display the GMaps of our volunteer & organization opportunities, while getting the speed improvements of hand-optimized MySQL code.
But maybe that's just a pipe dream :)
My apologies if this is a dead issue.
Just for kicks, here's one of the queries in question:
SELECT node.nid,node.title AS node_title,
node.changed AS node_changed,
location.latitude AS location_latitude,
location.longitude AS location_longitude,
node_data_field_volopp_org_name.field_volopp_org_name_value AS
node_data_field_volopp_org_name_field_volopp_org_name_value,
location.name AS location_name,
location.street AS location_street,
location.additional AS location_additional,
location.city AS location_city,
location.province AS location_province,
location.country AS location_country,
location.postal_code AS location_postal_code,
node_data_field_volopp_num_vol_opp.field_volopp_num_vol_opp_value AS
node_data_field_volopp_num_vol_opp_field_volopp_num_vol_opp_value,
node.type AS gmap_node_type,
gmap_taxonomy_node.marker AS gmap_taxonomy_marker
FROM {node} node
LEFT JOIN {location} location
ON node.vid = location.eid
LEFT JOIN {content_type_volunteer_opportunity} node_data_field_volopp_org_name
ON node.vid = node_data_field_volopp_org_name.vid
LEFT JOIN {content_type_volunteer_opportunity} node_data_field_volopp_num_vol_opp
ON node.vid = node_data_field_volopp_num_vol_opp.vid
LEFT JOIN {gmap_taxonomy_node} gmap_taxonomy_node
ON node.vid = gmap_taxonomy_node.vid
WHERE (node.type IN ('volunteer_opportunity'))
This has nothing to do with views
Hi,
views cannot make this any better.
The current taxonomy setup cannot manage hierarchical taxonomy deeper than two (parent-child) relationships efficiently.
We had a 5 level hierarchy and had to implement a separate table like this
Nid Level1 Level2 Level3 Level4 Level5
Nid column is nid and all the level columns are actual term ids. This way I see the entire hierarchy in one view and without any joins!!!!!!
Sweeet.
This shaved off approximately 70 000 milliseconds according to devel :DD
****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites
Bah, wrong focus.
Running into this, too, albeit with different queries, and I identify it as developers focusing on the wrong point.
What the query does, is check for terms in a hierarchical table (every record had a 'parent' ID that refers to a record in the same table) - and relational databases simply weren't designed for that. Even Oracle with it's CONNECT BY clause isn't all that performant.
Giorgio's solution maps better onto relational databases, but limits the depth you can go to the number of tables you create. It's also quite cluttersome.
The correct way to deal with this, is to do it in code.
Our original query:
mysql> SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created FROM node nodeLEFT JOIN term_node term_node ON node.nid = term_node.nid
LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid
LEFT JOIN term_hierarchy term_hierarchy2 ON term_hierarchy.parent = term_hierarchy2.tid
LEFT JOIN term_hierarchy term_hierarchy3 ON term_hierarchy2.parent = term_hierarchy3.tid
LEFT JOIN term_hierarchy term_hierarchy4 ON term_hierarchy3.parent = term_hierarchy4.tid
LEFT JOIN term_hierarchy term_hierarchy5 ON term_hierarchy4.parent = term_hierarchy5.tid
WHERE (node.status = '1')
AND (node.type IN ('artikel'))
AND (term_node.tid = '9505'
OR term_hierarchy2.tid = '9505'
OR term_hierarchy3.tid = '9505'
OR term_hierarchy4.tid = '9505'
OR term_hierarchy5.tid = '9505')
ORDER BY node_sticky DESC, node_created_created DESC
LIMIT 0, 5;
+--------+-------------+----------------------+
| nid | node_sticky | node_created_created |
+--------+-------------+----------------------+
| 185733 | 0 | 1264932120 |
| 158914 | 0 | 1240059939 |
+--------+-------------+----------------------+
2 rows in set (0.48 sec)
There's a near-identical one that does a count(), too. Notice this takes about three quarters of a second on our current dataset - quite a lot for a busy site. All timings are of course on an unloaded host and with SQL_NO_CACHE.
This query searches term_node and the depth of term_hierarchy for a specific term. The more levels of depth you take, the harder the query becomes - a join is basically a cartesian product, so every extra level makes the intermediate dataset grow exponentially.
So, how do you do this in code ? Simple.
First, you resolve the tree structure:
mysql> SELECT SQL_NO_CACHE tid, parentFROM term_hierarchy
WHERE tid = 9505;
+------+--------+
| tid | parent |
+------+--------+
| 9505 | 0 |
+------+--------+
1 row in set (0.00 sec)
If 'parent' is not 0, repeat with the new tid, and again and again until you're at the top of the tree.
Now you have what you were looking for in the hierarchy. Use this in a straight join of node and term_node. If you found nothing in the hierarchy, or found a different one as top-level, you must also include all the others you've found - I found that entries can exist in term_node that don't exist in term_hierarchy. In this example, I found only the same, so my IN has only one entry. If anyone has more precise insight into what can or cannot exist in those two tables, please correct.
mysql> SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_createdFROM node join term_node on node.nid = term_node.nid
WHERE status = 1
AND type in ('artikel')
AND tid in (9505)
ORDER BY node_sticky desc, node_created_created desc
LIMIT 0,5;
+--------+-------------+----------------------+
| nid | node_sticky | node_created_created |
+--------+-------------+----------------------+
| 185733 | 0 | 1264932120 |
| 158914 | 0 | 1240059939 |
+--------+-------------+----------------------+
2 rows in set (0.00 sec)
Ooo, a 0.48 second query replace by two 0.00 ones.
I passed this on to our devs, but no clue if they'll be allowed to feed the module back here. I suggest someone else do this. Don't forget to remove the SQL_NO_CACHE :-p
Thanks for passing along this
Thanks for passing along this approach. I was facing the same issue. Here's the function I wrote to get all descendant tids of a tid:
/*** Given a term id, return it and all descendant tids.
*/
function hci_tracksonomy_descendants($tid) {
$tids = array();
if ($tid) {
$resource = db_query('SELECT tid FROM {term_hierarchy} WHERE parent = %d', $tid);
while ($row = db_fetch_object($resource)) {
$tids[] = $row->tid;
$tids = array_merge($tids, hci_tracksonomy_descendants($row->tid));
}
}
return $tids;
}
Note that this does not return the original argument. If you want it (you probably do), add it, as in:
$tids = hci_tracksonomy_descendants($tid);$tids[] = $tid;
And here is how it is used to cook up pieces of SQL to be integrated into a node query:
/**
* Given a tid, return SQL to filter for nodes tagged with this term or any
* descendant term ids (if any).
*
* @param
* $tid - the parent term id
* $node_alias - alias of {node} table we're joining on
*
* @return
* $term_table - INNER JOIN SQL (or empty string)
* $term_where - WHERE CLAUSE SQL (or empty string)
* $term_args - array of substitution args (or empty array)
*/
function hci_tracksonomy_tid_filter($tid, $node_alias = 'n') {
$term_table = '';
$term_where = '';
$term_args = array();
if ($tid) {
$term_table = "INNER JOIN {term_node} tn ON tn.vid = $node_alias.vid";
$tids = hci_tracksonomy_descendants($tid);
$tids[] = $tid;
$ph = db_placeholders($tids, 'int');
$term_where = "AND tn.tid in ($ph)";
$term_args = array_merge($term_args, $tids);
}
return array($term_table, $term_where, $term_args);
}
Taxonomy Edge
The Taxonomy Edge implements a new data model for the terms making it easier and more efficient to retrieve terms with children and at arbitrary level.
It can turn this query:
SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created FROM node nodeLEFT JOIN term_node term_node ON node.nid = term_node.nid
LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid
LEFT JOIN term_hierarchy term_hierarchy2 ON term_hierarchy.parent = term_hierarchy2.tid
LEFT JOIN term_hierarchy term_hierarchy3 ON term_hierarchy2.parent = term_hierarchy3.tid
LEFT JOIN term_hierarchy term_hierarchy4 ON term_hierarchy3.parent = term_hierarchy4.tid
LEFT JOIN term_hierarchy term_hierarchy5 ON term_hierarchy4.parent = term_hierarchy5.tid
WHERE (node.status = '1')
AND (node.type IN ('artikel'))
AND (term_node.tid = '9505'
OR term_hierarchy2.tid = '9505'
OR term_hierarchy3.tid = '9505'
OR term_hierarchy4.tid = '9505'
OR term_hierarchy5.tid = '9505')
ORDER BY node_sticky DESC, node_created_created DESC
LIMIT 0, 5;
Into this:
SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created FROM node nodeJOIN term_node term_node ON node.nid = term_node.nid
JOIN term_edge term_edge ON term_edge.tid = term_node.tid
WHERE (node.status = '1')
AND (node.type IN ('artikel'))
AND (term_edge.parent = '9505'
ORDER BY node_sticky DESC, node_created_created DESC
LIMIT 0, 5;
The extra data in term_edge could be considered a "cache" as it does contain some redundant information from term_hierarchy. However, I like to think of it as an extension of the hierarchical data model.
It also integrates with Views
I've seen a lot of attempts to fix this fundamental issue from recursive function calls (code or sql) to nested sets (modified pre-order tree traversel). To me the best balance between READs/WRITEs is the technique used Taxonomy Edge.
Added to the wiki
Thanks for the heads up!
http://groups.drupal.org/node/187209