Huge views query killed the server

Events happening in the community are now at Drupal community events on www.drupal.org.
z.stolar's picture

Hello all,

I'm using the Taxonomy: term ID argument as a view argument, and I'm passing to it a list of about 20 term IDs (it changes from page to page).
The result is a huge query that chokes the MySQL server, and drops the site down. The query is pasted further down this mail.
After manually changing the query to use IN(tid, tid2, tid3...), and getting rid of many LEFT JOINS, the query ran without any problem, and quite fast.

I'd appreciate pointers to issues regarding this problem, if you know of any...

And for the curious ones, take a look at the query:

BRFORE

SELECT DISTINCT(node.nid), node.title AS node_title, node.changed AS node_changed FROM node node

LEFT JOIN content_type_library_item node_data_field_in_library ON node.vid = node_data_field_in_library.vid

LEFT JOIN term_node term_node ON node.nid = term_node.nid LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid LEFT JOIN term_node term_node2 ON node.nid = term_node2.nid LEFT JOIN term_hierarchy term_hierarchy2 ON term_node2.tid = term_hierarchy2.tid LEFT JOIN term_node term_node3 ON node.nid = term_node3.nid LEFT JOIN term_hierarchy term_hierarchy3 ON term_node3.tid = term_hierarchy3.tid LEFT JOIN term_node term_node4 ON node.nid = term_node4.nid LEFT JOIN term_hierarchy term_hierarchy4 ON term_node4.tid = term_hierarchy4.tid LEFT JOIN term_node term_node5 ON node.nid = term_node5.nid LEFT JOIN term_hierarchy term_hierarchy5 ON term_node5.tid = term_hierarchy5.tid LEFT JOIN term_node term_node6 ON node.nid = term_node6.nid LEFT JOIN term_hierarchy term_hierarchy6 ON term_node6.tid = term_hierarchy6.tid LEFT JOIN term_node term_node7 ON node.nid = term_node7.nid LEFT JOIN term_hierarchy term_hierarchy7 ON term_node7.tid = term_hierarchy7.tid LEFT JOIN term_node term_node8 ON node.nid = term_node8.nid LEFT JOIN term_hierarchy term_hierarchy8 ON term_node8.tid = term_hierarchy8.tid LEFT JOIN term_node term_node9 ON node.nid = term_node9.nid LEFT JOIN term_hierarchy term_hierarchy9 ON term_node9.tid = term_hierarchy9.tid  LEFT JOIN term_node term_node10 ON node.nid = term_node10.nid LEFT JOIN term_hierarchy term_hierarchy10 ON term_node10.tid = term_hierarchy10.tid LEFT JOIN term_node term_node11 ON node.nid = term_node11.nid LEFT JOIN term_hierarchy term_hierarchy11 ON term_node11.tid = term_hierarchy11.tid LEFT JOIN term_node term_node12 ON node.nid = term_node12.nid LEFT JOIN term_hierarchy term_hierarchy12 ON term_node12.tid = term_hierarchy12.tid LEFT JOIN term_node term_node13 ON node.nid = term_node13.nid LEFT JOIN term_hierarchy term_hierarchy13 ON term_node13.tid = term_hierarchy13.tid LEFT JOIN term_node term_node14 ON node.nid = term_node14.nid LEFT JOIN term_hierarchy term_hierarchy14 ON term_node14.tid = term_hierarchy14.tid LEFT JOIN term_node term_node15 ON node.nid = term_node15.nid LEFT JOIN term_hierarchy term_hierarchy15 ON term_node15.tid = term_hierarchy15.tid LEFT JOIN term_node term_node16 ON node.nid = term_node16.nid LEFT JOIN term_hierarchy term_hierarchy16 ON term_node16.tid = term_hierarchy16.tid LEFT JOIN term_node term_node17 ON node.nid = term_node17.nid LEFT JOIN term_hierarchy term_hierarchy17 ON term_node17.tid = term_hierarchy17.tid LEFT JOIN term_node term_node18 ON node.nid = term_node18.nid LEFT JOIN term_hierarchy term_hierarchy18 ON term_node18.tid = term_hierarchy18.tid LEFT JOIN term_node term_node19 ON node.nid = term_node19.nid LEFT JOIN term_hierarchy term_hierarchy19 ON term_node19.tid = term_hierarchy19.tid

WHERE (node_data_field_in_library.field_in_library_nid = 2045) AND (term_node.tid = '216') AND (term_node2.tid = '24') AND (term_node3.tid =
'10') AND (term_node4.tid = '148') AND (term_node5.tid = '25') AND (term_node6.tid = '139') AND (term_node7.tid = '11') AND (term_node8.tid = '14') AND (term_node9.tid = '175') AND (term_node10.tid = '212') AND (term_node11.tid = '213') AND (term_node12.tid = '214') AND (term_node13.tid = '30') AND (term_node14.tid = '33') AND (term_node15.tid = '35') AND (term_node16.tid = '18') AND (term_node17.tid = '20') AND (term_node18.tid = '3') AND (term_node19.tid = '2') GROUP BY node.nid, node_title ORDER BY node_title ASC LIMIT 0, 50;

AFTER

SELECT DISTINCT(node.nid), node.title AS node_title, node.changed AS node_changed FROM node node
  LEFT JOIN content_type_library_item node_data_field_in_library ON node.vid = node_data_field_in_library.vid

  LEFT JOIN term_node term_node ON node.nid = term_node.nid
  LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid

  WHERE (node_data_field_in_library.field_in_library_nid = 2045)

  AND term_node.tid IN ('216','24','10','148','25','139','11','14','175','212','213','214','30','33','35','18','20','3','2')

  GROUP BY node.nid, node_title ORDER BY node_title ASC LIMIT 0, 50;

Comments

How many times can a table be JOINed to itself?

z.stolar's picture

Seems that I had a small mistake, but it made me understand a critical limitation of views or perhaps of the way taxonomy is built:

In the above rewriting of the SQL, I actually created a completely different query, so the fix dosen't really fix - I've made the query an OR query instead of AND.

Now, digging a bit more inside (with the kind help of yhager), I now understand that the problem lies in the way taxonomy tables are built, more exact, the term_node table: When we want to choose a node that has term1 AND term2, we JOIN the term_node table to itself. The question is - how scalable is that method, and could there possibly be another method? Another sophisticated SQL syntax?

core problem

moshe weitzman's picture

this is how care taxonomy has worked since 2002 when it was born. this is not related to Views. You get the same query if you do taxonomy/term/3,4,5 for example (tids might be separated by +, i can't remember which is which)

I certainly don't know of

merlinofchaos's picture

I certainly don't know of another way to do AND with terms. It gets even worse when you add depth, too.

New table in the database?

dwees's picture

Are the terms you want hard-coded, or can the user select from a whole bunch of selects? There might be a solution to this, but it's a bit fussy.

  1. Create a table where the columns are vid (as int(10)), and tids_as_string (varchar(however many terms)) you want to include.
  2. Create a hook_form_alter where you modify the submit handler for taxonomy terms.
  3. In this hook_form_alter, associate each tid with a position in a string, and assign a 0 to that string if the term is chosen and a 1 otherwise. Store the final string in the database. Alternatively, you can assign a power of 2 to each term, and store the sum of the chosen powers of 2.
  4. When you want to query for the particular nodes in question, do a search for all vid that have the correct string associated with them in the tids_as_string column (see step 3) and inner join it to the node table (and your other custom table).

Basically what this does is spreads out the hit of the query in question a bit, it's like doing the joining when you are entering the data. It's a bit of a hackish solution that I haven't actually tried out.

Actually having posted this solution, I've found some reasons why it won't work (in general).

  1. We can't store the information as an integer because MYSQL maximum integer size is 18446744073709551615 (which is 264) which would limit you to 64 terms maximum. If this limitation doesn't affect you, then store the sum of the powers of 2 of the individual terms.

  2. If we store the information as a string, then whenever we add new taxonomy terms, we'll have to update the string already in the database so that they show up in our search, which sounds query heavy to me.

More simply, you could create this table and store a bunch of 1's and 0's and save yourself the joins later on. Not as efficient as I was hoping for but, oh well.

Sorry, I don't see a general solution that works completely either, but I'm going to post this partial solution so we all can have a good laugh over it.

Dave

Will views2 make it better?

druvision's picture

It was funny to see the huge query.
Will views2 make it better?

Amnon
-
Professional: Drupal Israel | Drupal Development & Consulting | Effective Hosting Strategies | בניית אתרים
Personal: Hitech Dolphin: Regain Simple Joy :)

EvanDonovan's picture

We have some very complex views queries running on www.urbanministry.org (on our volunteer & organization search pages - /volunteer & /org) that we'd like to optimize, as we have for our other site, www.christianvolunteering.org.

z.stolar, you said that you "manually changed the query" & it ran faster. Do you mean that you actually got Views to run the optimized version of the query or just that you tested it in MySQL? I would love to leverage the Views theming framework to display the GMaps of our volunteer & organization opportunities, while getting the speed improvements of hand-optimized MySQL code.

But maybe that's just a pipe dream :)

My apologies if this is a dead issue.

Just for kicks, here's one of the queries in question:

SELECT node.nid,
node.title AS node_title,
node.changed AS node_changed,
location.latitude AS location_latitude,
location.longitude AS location_longitude,
node_data_field_volopp_org_name.field_volopp_org_name_value AS
node_data_field_volopp_org_name_field_volopp_org_name_value,
location.name AS location_name,
location.street AS location_street,
location.additional AS location_additional,
location.city AS location_city,
location.province AS location_province,
location.country AS location_country,
location.postal_code AS location_postal_code,
node_data_field_volopp_num_vol_opp.field_volopp_num_vol_opp_value AS
node_data_field_volopp_num_vol_opp_field_volopp_num_vol_opp_value,
node.type AS gmap_node_type,
gmap_taxonomy_node.marker AS gmap_taxonomy_marker
FROM {node} node
LEFT JOIN {location} location
   ON node.vid = location.eid
LEFT JOIN {content_type_volunteer_opportunity} node_data_field_volopp_org_name
   ON node.vid = node_data_field_volopp_org_name.vid
LEFT JOIN {content_type_volunteer_opportunity} node_data_field_volopp_num_vol_opp
   ON node.vid = node_data_field_volopp_num_vol_opp.vid
LEFT JOIN {gmap_taxonomy_node} gmap_taxonomy_node
   ON node.vid = gmap_taxonomy_node.vid
     WHERE (node.type IN ('volunteer_opportunity'))

This has nothing to do with views

giorgio79's picture

Hi,

views cannot make this any better.

The current taxonomy setup cannot manage hierarchical taxonomy deeper than two (parent-child) relationships efficiently.

We had a 5 level hierarchy and had to implement a separate table like this

Nid Level1 Level2 Level3 Level4 Level5

Nid column is nid and all the level columns are actual term ids. This way I see the entire hierarchy in one view and without any joins!!!!!!

Sweeet.

This shaved off approximately 70 000 milliseconds according to devel :DD

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

Bah, wrong focus.

vegivamp's picture

Running into this, too, albeit with different queries, and I identify it as developers focusing on the wrong point.

What the query does, is check for terms in a hierarchical table (every record had a 'parent' ID that refers to a record in the same table) - and relational databases simply weren't designed for that. Even Oracle with it's CONNECT BY clause isn't all that performant.

Giorgio's solution maps better onto relational databases, but limits the depth you can go to the number of tables you create. It's also quite cluttersome.

The correct way to deal with this, is to do it in code.

Our original query:

mysql> SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created   FROM node node
     LEFT JOIN term_node term_node ON node.nid = term_node.nid
     LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid
     LEFT JOIN term_hierarchy term_hierarchy2 ON term_hierarchy.parent = term_hierarchy2.tid
     LEFT JOIN term_hierarchy term_hierarchy3 ON term_hierarchy2.parent = term_hierarchy3.tid
     LEFT JOIN term_hierarchy term_hierarchy4 ON term_hierarchy3.parent = term_hierarchy4.tid
     LEFT JOIN term_hierarchy term_hierarchy5 ON term_hierarchy4.parent = term_hierarchy5.tid
   WHERE (node.status = '1')
    AND (node.type IN ('artikel'))
    AND (term_node.tid = '9505'
      OR term_hierarchy2.tid = '9505'
      OR term_hierarchy3.tid = '9505'
      OR term_hierarchy4.tid = '9505'
      OR term_hierarchy5.tid = '9505')
   ORDER BY node_sticky DESC, node_created_created DESC
   LIMIT 0, 5;
+--------+-------------+----------------------+
| nid    | node_sticky | node_created_created |
+--------+-------------+----------------------+
| 185733 |           0 |           1264932120 |
| 158914 |           0 |           1240059939 |
+--------+-------------+----------------------+
2 rows in set (0.48 sec)

There's a near-identical one that does a count(), too. Notice this takes about three quarters of a second on our current dataset - quite a lot for a busy site. All timings are of course on an unloaded host and with SQL_NO_CACHE.

This query searches term_node and the depth of term_hierarchy for a specific term. The more levels of depth you take, the harder the query becomes - a join is basically a cartesian product, so every extra level makes the intermediate dataset grow exponentially.

So, how do you do this in code ? Simple.

First, you resolve the tree structure:

mysql> SELECT SQL_NO_CACHE tid, parent
  FROM term_hierarchy
  WHERE tid = 9505;
+------+--------+
| tid  | parent |
+------+--------+
| 9505 |      0 |
+------+--------+
1 row in set (0.00 sec)

If 'parent' is not 0, repeat with the new tid, and again and again until you're at the top of the tree.

Now you have what you were looking for in the hierarchy. Use this in a straight join of node and term_node. If you found nothing in the hierarchy, or found a different one as top-level, you must also include all the others you've found - I found that entries can exist in term_node that don't exist in term_hierarchy. In this example, I found only the same, so my IN has only one entry. If anyone has more precise insight into what can or cannot exist in those two tables, please correct.

mysql> SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created
  FROM node join term_node on node.nid = term_node.nid
  WHERE status = 1
    AND type in ('artikel')
    AND tid in (9505)
  ORDER BY node_sticky desc, node_created_created desc
  LIMIT 0,5;
+--------+-------------+----------------------+
| nid    | node_sticky | node_created_created |
+--------+-------------+----------------------+
| 185733 |           0 |           1264932120 |
| 158914 |           0 |           1240059939 |
+--------+-------------+----------------------+
2 rows in set (0.00 sec)

Ooo, a 0.48 second query replace by two 0.00 ones.

I passed this on to our devs, but no clue if they'll be allowed to feed the module back here. I suggest someone else do this. Don't forget to remove the SQL_NO_CACHE :-p

Thanks for passing along this

dwightaspinwall's picture

Thanks for passing along this approach. I was facing the same issue. Here's the function I wrote to get all descendant tids of a tid:

/**
* Given a term id, return it and all descendant tids.
*/
function hci_tracksonomy_descendants($tid) {
  $tids = array();
  if ($tid) {
    $resource = db_query('SELECT tid FROM {term_hierarchy} WHERE parent = %d', $tid);
    while ($row = db_fetch_object($resource)) {
      $tids[] = $row->tid;
      $tids = array_merge($tids, hci_tracksonomy_descendants($row->tid));
    }   
  }
  return $tids;
}

Note that this does not return the original argument. If you want it (you probably do), add it, as in:

$tids = hci_tracksonomy_descendants($tid);
$tids[] = $tid;

And here is how it is used to cook up pieces of SQL to be integrated into a node query:

/**
* Given a tid, return SQL to filter for nodes tagged with this term or any
* descendant term ids (if any).
*
* @param
*  $tid - the parent term id
*  $node_alias - alias of {node} table we're joining on
*
* @return
*  $term_table - INNER JOIN SQL (or empty string)
*  $term_where - WHERE CLAUSE SQL (or empty string)
*  $term_args - array of substitution args (or empty array)
*/
function hci_tracksonomy_tid_filter($tid, $node_alias = 'n') {
  $term_table = '';
  $term_where = '';
  $term_args = array();
 
  if ($tid) {
    $term_table = "INNER JOIN {term_node} tn ON tn.vid = $node_alias.vid";
    $tids = hci_tracksonomy_descendants($tid);
    $tids[] = $tid;
    $ph = db_placeholders($tids, 'int');
    $term_where = "AND tn.tid in ($ph)";
    $term_args = array_merge($term_args, $tids);
  }

  return array($term_table, $term_where, $term_args);
}

Taxonomy Edge

gielfeldt's picture

The Taxonomy Edge implements a new data model for the terms making it easier and more efficient to retrieve terms with children and at arbitrary level.

It can turn this query:

SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created   FROM node node
     LEFT JOIN term_node term_node ON node.nid = term_node.nid
     LEFT JOIN term_hierarchy term_hierarchy ON term_node.tid = term_hierarchy.tid
     LEFT JOIN term_hierarchy term_hierarchy2 ON term_hierarchy.parent = term_hierarchy2.tid
     LEFT JOIN term_hierarchy term_hierarchy3 ON term_hierarchy2.parent = term_hierarchy3.tid
     LEFT JOIN term_hierarchy term_hierarchy4 ON term_hierarchy3.parent = term_hierarchy4.tid
     LEFT JOIN term_hierarchy term_hierarchy5 ON term_hierarchy4.parent = term_hierarchy5.tid
   WHERE (node.status = '1')
    AND (node.type IN ('artikel'))
    AND (term_node.tid = '9505'
      OR term_hierarchy2.tid = '9505'
      OR term_hierarchy3.tid = '9505'
      OR term_hierarchy4.tid = '9505'
      OR term_hierarchy5.tid = '9505')
   ORDER BY node_sticky DESC, node_created_created DESC
   LIMIT 0, 5;

Into this:

SELECT SQL_NO_CACHE node.nid, node.sticky AS node_sticky, node.created AS node_created_created   FROM node node
     JOIN term_node term_node ON node.nid = term_node.nid
     JOIN term_edge term_edge ON term_edge.tid = term_node.tid
   WHERE (node.status = '1')
    AND (node.type IN ('artikel'))
    AND (term_edge.parent = '9505'
   ORDER BY node_sticky DESC, node_created_created DESC
   LIMIT 0, 5;

The extra data in term_edge could be considered a "cache" as it does contain some redundant information from term_hierarchy. However, I like to think of it as an extension of the hierarchical data model.

It also integrates with Views

I've seen a lot of attempts to fix this fundamental issue from recursive function calls (code or sql) to nested sets (modified pre-order tree traversel). To me the best balance between READs/WRITEs is the technique used Taxonomy Edge.

Added to the wiki

mikeytown2's picture

Thanks for the heads up!
http://groups.drupal.org/node/187209

Views Developers

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: