Using database schema to load nodes in fewer queries

Posted by bjaspan on April 17, 2007 at 4:23am

One thing that becomes possible when Drupal can access its own database schema is that queries which currently are performed separately can instead be performed together. In particularly, I think it might be possible to load an entire node with a single SELECT query or, at least, in far fewer than Drupal currently uses. Working proof of concept code is already in the Schema module; it loads all CCK fields, comments, and author info in one query. Loading more things only requires adding a join entry to the schema data structure (see below).

Note that I do not yet know if it is faster to load nodes in one query, only that it seems possible. I'd like to use this thread to explore the idea and see what people think. To get started, I'm reposting here (with some updates) a followup I posted in http://drupal.org/node/136171.

Suppose I define a CCK type called "type_1" with a single field, "field_1". In Schema's data structure, the table looks like this:

$schema['content_type_type_1'] = array(
    'cols' => array(
         'vid' => array('type' => 'int', ... 'join' => array('node', 'vid', 'one')),
         'nid' => array('type' => 'int', ...),
         'field_field_1_value' => array('type' => 'longtext', ...)),
    'keys' => array(
         'PRIMARY' => array('vid')),
);

(Actually, that isn't exactly how it currently looks; the data structure layout is a moving target.)

So, the table has two columns, vid and nid, and its primary key is vid. The vid field is a 1-to-1 join to node.vid.

When a node is loaded, Schema's hook_nodeapi is called. It looks through the total schema (the sum of all structures like above returned by all modules) to find all joins to node (because this is nodeapi, we know we're loading from table node) and builds up a SELECT/JOIN list. Suppose the schema contains type_1 as above and another module with a table m2 that joins to node on nid and has columns 'foo' and 'bar'. schema_nodeapi builds this query:

SELECT content_type_type_1.nid as content_type_type_1_nid, m2.vid as m2_vid, m2.foo as m2_foo, m2.bar as m2_bar
FROM node n
LEFT JOIN content_type_type_1 ON content_type_type_1.vid = v.nid
LEFT JOIN m2 ON m2.nid = n.nid
WHERE n.nid = %d

It executes this query for the nid being loaded and saves the results of this query in $node->schema_data. schema_data['content_type_type_1'] contains the array of matching rows from the content_type_type_1 table and schema_data['m2'] contains the array of matching rows from the m2 table. So now, CCK's nodeapi and m2's nodeapi do not need to execute their own SELECT query. They can both just use $node->schema_data which is waiting for them. So, we've turned three queries into one.

It gets a little more complex if one or more of the joins are 1-to-many (e.g. comments, multi-valued CCK fields, taxonomy terms, etc.). However, so long as the joined tables have a primary key, it is possible to identify all of the unique rows of data returned from each table and save them in $node->schema_data, ignoring the duplicates that come as the result of multiple multi-value joins. (This code is written and working too but not yet committed as it needs some cleanup first.)

Currently, this happens in schema_nodeapi() but it obviously could be moved directly into node_load() as part of the initial SELECT into the node table, eliminating another query.

Schema exports $schema structures for all CCK types and fields so this automatically works for all CCK per-content-type fields in the content_type_* tables and multi-valued or shared per-field fields in content_field_* tables. Depending on the relative performance of this one query with a bunch of LEFT JOINs, we might be able to eliminate CCK's field cache table, cutting the data storage requirements for CCK nodes in half without a performance penalty.

And this isn't limited to CCK. As I mentioned above, everything that connects to node could be loaded at once: comments & comment statistics, taxonomy terms, the author's user record, and all the add-on data currently loaded by many modules (book, event, etc.) that is keyed on nid.

This does result in a lot of LEFT JOINs. One optimization is to use node types to eliminate unnecessary joins. For any given node, only a subset of tables that join to node.nid/vid are relevant: content_type_page is only relevant for page nodes, and the book table is only relevant for those node types that book.module is configured to book-enable. Each module could encode in its schema the node types the join is relevant for. For example, if book.module is only enabled for 'page' nodes, it could say:

function book_schema() {
$schema['book'] = array(
'cols' => array(..., 'join' => array('node', 'nid', 'one'), 'node-types' => array('page'))
);
return $schema;
}

Schema's nodeapi knows the type of the node being loaded because node_load already got it, so it can simply skip any join that specifies a node-type list and doesn't match the current node's type. To move this one-query approach into node_load and use this optimization, we will need to keep a cache mapping nid => node type (note that a node, once created, never changes its type). Since node types are write-only, this could be done as simply as having cron output a file defining a array containing the mapping, so no query is even needed to get the nid => node type cache, just a file load. Schema already implements the node-type optimization but I have not yet implemented the nid=>type cache.

So, this is interesting and all, but the most important question is: Is it faster? I honestly don't know. Maybe one big query with many LEFT JOINs is slower than 10-15 separate queries, each against one or two tables. There is also some overhead in processing the results of the big query, ignoring duplicate rows, etc. Performance testing will be necessary.

I'm also quite sure there are bugs to be worked out, improvements to to made, etc.

Questions? Feedback?

Comments

One problem that has already

Posted by karens on April 17, 2007 at 10:54am

One problem that has already come up in Views is that lots of joins of CCK fields, with their long table and field names, creates queries that are too large to handle in some environments. In a shared hosting environment where there are limits on how much configuration can be done, that could be a problem. Just food for thought.

Do you mean the queries are

Posted by bjaspan on April 17, 2007 at 1:09pm

Do you mean the queries are too large in their textual representation or that there is an imposed limit on the number of joins?

The actual size of the query

Posted by karens on April 17, 2007 at 5:11pm

The actual size of the query text can be a problem. See http://drupal.org/node/121390.

not a problem here

Posted by moshe weitzman on April 17, 2007 at 5:21pm

i thin we are OK. the issue there is indeed that the SQL is too large but thats because it contains values. It is an UPDATE query.

A node_load SQL contains only field names and joins and such. No way are we gonna hit 1MB this way.

Haven't we been down this road before?

Posted by adrian on April 17, 2007 at 1:42pm

Specifically with Flexinode, which used this design.

IIRC there's a limit of something like 32 joins that mysql can actually handle in a query.

--
The future is so Bryght, I have to wear shades.

not quite

Posted by moshe weitzman on July 5, 2007 at 1:08am

flexinode had only one very tall table for all fields. cck has a much better db design, so we only have dedicated tables for fields that are shared across content types. i don't think we will hit 32 join limit. that would likely happen after 100+ fields. where 1/3 of them are shared across content types.

Node Scaffolding

Posted by dldege on November 20, 2007 at 9:31pm

I've been looking at D6 schema a the schema module and very interested in the idea of providing automated node scaffolding for custom node types. The idea is that a single module can provide the db scaffolding for node insert, update, delete, delete revision, etc. by simply inspecting the node's schema. As a test I added this code to schema module - its not complete - but it provided automatic db handling for my "scaffoldtest" node type which was simple module with a .install file and a module with hook_form_alter for buiding the add/edit form - no db handling was needed in the scaffoldtest module. Something like this could potentially make it possible to make a custom node type without having to do any manual db manipulation and without needing cck for custom data fields. I'd be interested in comments and other ideas in this area.

<?php
function schema_scaffold($nodetype) {
  //TODO - figure out when to scaffold - hard code for now
  if ($nodetype == 'scaffoldtest')
    return true;
    
  return false;
}

function schema_nodeapi(&$node, $op, $a3 = NULL, $a4 = NULL) {
  if (schema_scaffold($node->type)) {
  $schema = drupal_get_schema($node->type, TRUE);
  $op = str_replace(' ', '<em>', $op); //$op = delete revision, rss item, etc.
    $function = "schema_nodeapi</em>$op";
    //drupal_set_message($function);
    if (function_exists($function))
      return $function($node, $a3, $a4);
  }
}

function schema_nodeapi_load(&$node, $a3, $a4) {
  $schema = drupal_get_schema($node->type, TRUE);
  
  $sql = "SELECT * FROM {" . $schema['name'] . "} WHERE nid = %d and vid = %d";
  $add = db_fetch_array(db_query($sql, $node->nid, $node->vid));  
  return $add;
}

function schema_nodeapi_insert(&$node, $a3, $a4) {
  $schema = drupal_get_schema($node->type, FALSE);
  $sql = "INSERT into {" . $schema['name'] . "} (";
  foreach ($schema['fields'] as $name => $field) {
    $names[] = $name;
    switch ($field['type']) {
      case 'int' :
        $formats[] = '%d';
        break;
      case 'varchar' :
        $formats[] = "'%s'";
        break;
    }
    $values[] = $node->$name;
  }
  
  $sql .= implode($names, ',') . ") VALUES (" . implode($formats, ',') . ")";
  db_query($sql, $values);  
}

function schema_nodeapi_update(&$node, $a3, $a4) {
  if ($node->revision) {
    schema_nodeapi_insert($node, $a3, $a4);
  }
  else {
    $schema = drupal_get_schema($node->type, FALSE);
    
    $sql = "UPDATE {" . $schema['name'] . "} SET ";
    foreach ($schema['fields'] as $name => $field) {
      $format = $name . ' = ';
      switch ($field['type']) {
        case 'int' :
          $format .= '%d';
          break;
        case 'varchar' :
          $format .= "'%s'";
          break;
      }
      $formats[] = $format;
      $values[] = $node->$name;
    }
    
    $values[] = $node->nid;
    $values[] = $node->vid;

    $sql .= implode($formats, ',') . ' WHERE nid = %d and vid = %d';
    db_query($sql, $values);  
  }
}

function schema_nodeapi_delete(&$node, $a3, $a4) {
  $schema = drupal_get_schema($node->type, TRUE);  
  $sql = "DELETE FROM {" . $schema['name'] . "} WHERE nid = %d";
  db_query($sql, $node->nid);  
}

function schema_nodeapi_delete_revision(&$node, $a3, $a4) {
  $schema = drupal_get_schema($node->type, TRUE);  
  $sql = "DELETE FROM {" . $schema['name'] . "} WHERE nid = %d AND vid = %d";
  db_query($sql, $node->nid, $node->vid);  
}
?>

Dan DeGeest
Software Developer
Somewhere or Another

Using database schema to load nodes in fewer queries

Comments

One problem that has already

Do you mean the queries are

The actual size of the query

not a problem here

Haven't we been down this road before?

not quite

Node Scaffolding

Database

Group organizers

New groups

Group notifications

Hot content this week