Posted by bjaspan on June 1, 2007 at 5:31pm
Now that we have four more weeks before D6 code freeze, maybe it is possible to add join relationships/foreign key support into Schema API 1 after all. Otherwise, it will be another year before this info is in core. Note that this info does not really have to be in core---it can be added by a contrib module hook_schema_alter. But a lot of people would be pretty psyched to have referential integrity for pgsql built in to D6. So:
- Make a list of all the key relationships in the core tables.
- Decide how to encode the info in the schema data structure.
- Add foreign key constraints in database.pgsql.inc.
- ...
- Profit!
I have a pretty good idea for all the steps (at least the first three :-) but, for right now, am just creating this post as a placeholder. Jump in!

Comments
Joins != foreign keys
I want to record some info while I'm thinking about it:
There are some joins in Drupal that cannot be represented as foreign keys. For example, the votingapi_vote table has columns content_type (node, comment, etc.) and content_id (nid, cid, etc.). We cannot make content_id a foreign key reference to another table because it might be a reference to one of many tables. Perhaps this represents a design flaw in votingapi's schema but that is how it currently exists.
However, we would still like to be able represent the join connection between votingapi_vote columns of content_type 'node' to node.nid (for Views, for a single-query node_load(), etc). Off the top of my head, we might specify it like this:
<?php$schema['votingapi_vote'] = array(
'fields' => ...,
'joins' => array(
'content_id' => array('node', 'nid', /<em>WHERE</em>/ 'content_type=\'node\''),
));
?>
Without the third element (marked /WHERE/), this would represent a foreign key from votingapi_vote.content_id to node.nid. With the WHERE, it can no longer be a foreign key, but it can still be a join:
FROM node n LEFT JOIN votingapi_vote v ON n.nid = v.content_id AND v.content_type='node'(One gotcha is that the schema structure says "content_type='node'" and does not know what the table will be called in any particular query, but in the query the column references need to be fully canonicalized. Perhaps the WHERE clause in the schema structure should be "THIS.content_type", with THIS replace by the join builder at runtime. I'm guessing Earl already knows the right way to do this.)
Incidentally, Moshe points out that the core node_access table has the same multi-use column nature, so it isn't just a contrib problem.
:-(
I generally frown on schemas that represent relationships with some kind of type/id pair. There are many alternative methods for structuring the data that allow proper foreign keys.
strtr on a SQL snippet
I would recommend allowing the user to have a snippet of SQL, using %notation to represent table names. As David points out, key value pairs cannot represent the full range of SQL JOINS. Using tokens for table names allows tools to generate table aliases for the joins (e.g. views).
Consider the case of the bio module. It needs to join the node table against itself using the following SQL:
node n LEFT JOIN node bio ON n.uid = b.uid AND b.type = "bio"However, what if we want the bio node of the author of a nodereference? Then when need to join against an alias of node, like this:
node n INNER JOIN node_data_field_nodereference nr ON n.vid = nr.vid LEFT JOIN node bio ON nr.uid = bio.uid AND bio.type = 'bio'Using tokens for tables I could create a schema definition like:
<?php$schema['node']['foreign keys'] = array (
'node' => array (
'join' => 'left',
'on' => "%left_table.uid = %right_table.uid AND %right_table.type = 'bio'",
),
);
?>
This should be cross DB compatible and would allow for very expressive foreign keys.
I like the idea of using
I like the idea of using %left and %right in the 'on' clause, that is better than my two suggestions of THIS and {}'s to allow the same thing (referencing the aliased name of the current table).
However, I think we need to clarify the distinction between foreign keys and joins. A foreign key, at least as PostgresQL defines it, is a list of columns from a local table and a foreign table such that the local table columns are required to exist exactly in the foreign table with no other specified conditions (also, a foreign key must reference columns that either are a primary key or form a unique constraint). As an example:
CREATE TABLE t1 (a integer PRIMARY KEY,
b integer,
c integer,
FOREIGN KEY (b, c) REFERENCES other_table (c1, c2)
);
A join, as I think of it, is any join that does not meet the foreign key conditions. The bio example Mark provided is one; it cannot be specified as a foreign key.
This is why I proposed the foreign key structure I did. We cannot easily specify them per-field because a foreign key can include multiple fields, like an index. We have to list the columns involved in the key explicitly because we do not want to be in the business of parsing an 'ON' clause to figure out what the keys are.
I included the ability to specify an 'ON' clause (I called it a 'WHERE' clause even though, yes, the join syntax uses 'ON') precisely for cases like bio as Mark mentioned. However, as I said in my first post, as soon as you use the ON clause, you are not specifying a foreign key that will enforce referential integrity (it cannot be turned into a FOREIGN KEY SQL constraint). You are just specifying an arbitrary join.
Maybe other database engines allow foreign key constraints to include arbitrary SQL expressions, but if support that we will not be able to use it with PostgresQL which is, after all, the only database we currently support that can enforce referential integrity.
ahah
thanks for clarifying joins versus FK. i didn't quite grasp that.
Such as?
Re: "There are many alternative methods for structuring the data that allow proper foreign keys."
Such as?
any progress?
anyone working on this? it would be nice if this got into the schema files, even if we didn't implement the foreign key constraints in core. see chx's comment in the recent devel mail list on this topic. he pointed out that the order of deletes must be carefully managed and that can get complicated with two arbitrary tables relate to each other (i.e. not a nid or uid).
Proposed foreign key representation
Here's a proposal for representing foreign key information in the schema structure:
<?php$schema['T1'] = array(
'fields' => array('F1' => ..., 'F2' => ..., 'F3' => ..., 'F4' => ..., 'F5' => ...),
'foreign keys' => array(
'T2' => array('F1' => 't2f1'),
'T3' => array('F2' => 't3f2', 'F3' => 't3f3'),
'T4' => array('F4' => 't4f4', 'WHERE' => "{F5} = 'node'"),
);
?>
So, the column T1.F1 is a foreign key to T2.t2f1, and the columns (T1.F2, T1.F3) are a foreign key to the columns (T3.t3f2, T3.f3f3).
Column T1.F4 is joined to T4.t4f4 given the addition WHERE clause of T1.F5 = 'node'. F5 is inside curly brackets so that schema.module knows to prefix it with whatever alias T1 has in the query.
Two forms of syntactic sugar are possible. Suppose that tables TA and TB both have columns named x and y and they are foreign keys from TB to TA. We could say:
<?php$schema['TB'] = array(
'fields' => ...,
'foreign keys' => array(
'TA' => array('x', 'y'),
));
?>
We don't have to say
'x' => 'x'and'y' => 'y'. Even simpler, if TA and TB both have a single column x that is the foreign key, we could just say:<?php$schema['TB'] = array(
'fields' => ...,
'foreign keys' => array(
'TA' => 'x'
));
?>
Since there is only one column in the foreign key and it has the same name in both tables, we don't even have to specify an array.
Comments?
more example?
would be great to see a couple real world examples. the notation is a bit dry.
Fair point.
Sure. The simple case is a table using a single column as a foreign key to another table. For example, here's how node_revisions announces it wants to join to node on the nid column:
<?php$schema['node_revisions'] = array(
'fields' => ...,
'foreign keys' => array(
'node' => array('nid' => 'nid'),
));
?>
That's the full-specification form. In the ['foreign keys']['node'] array, the key 'nid' refers to the local table and the value 'nid' refers to the foreign table. I also propose two simplifications. First, whenever the local and foreign column names are the same the key can be omitted; in this case we would have
'node' => array('nid'). Second, whenever the local and foreign column names are the same and there is exactly one column involved, the array can be omitted; in this case we would have just'node' => 'nid'.Of course, there can be (a) multiple elements in the 'foreign keys' array, one for each foreign table that this table has a foreign key for and (b) multiple elements in each table's entry in the foreign key array, one for each column in the foreign key.
How foreign keys should be represented
Because foreign keys correspond directly to fields in a table, they should be maintained within the 'fields' arrays in schema API. Foreign keys should probably just be named after the field; I don't see an advantage to naming them more descriptively. Indexes are maintained separately from fields because they can contain multiple fields each.
I started off thinking that
I started off thinking that foreign keys should be in the fields array, too. But foreign keys can contain multiple fields each, just like indexes can; e.g. http://www.postgresql.org/docs/8.2/interactive/ddl-constraints.html#DDL-....
It is not clear to me how to represent this cleanly in the fields array, which is why I put them as a new top-level entry in the table definition. Do you have a suggestion?
Rails Active Records approach
Hello! I've been researching on how Rails Active Records does object/schema relationships. It does object relationships on the Model layer, not direct on database by using foreign keys and such. However it has interesting approaches we can copy/adapt to our database schema layer.
I've came up with an approach for table reference definitions, based on Rails Active Records and allowing modules to extend each other schema definitions (not sure if it is currently possible, but should be easy to implement).
Here are some stripped down schema definitions just to illustrate how 'references' would work:
<?php
/**
* users.schema
*/
$schema['users'] = array(
'fields' => array(
'uid' => array('type' => 'serial', 'unsigned' => TRUE, 'not null' => TRUE),
'name' => array('type' => 'varchar', 'length' => 60, 'not null' => TRUE, 'default' => ''),
),
'references' => array(
'roles' => array(
'type' => 'multiple',
'join' => array('role' => array('rid' => 'rid')),
'through' => array('users_roles' => array('uid' => 'uid')),
),
),
);
$schema['role'] = array(
'fields' => array(
'rid' => array('type' => 'serial', 'unsigned' => TRUE, 'not null' => TRUE),
'name' => array('type' => 'varchar', 'length' => 64, 'not null' => TRUE, 'default' => ''),
),
'references' => array(
'users' => array(
'type' => 'multiple',
'join' => array('users' => array('uid' => 'uid')),
'through' => array('users_roles' => array('rid' => 'rid')),
),
),
);
$schema['users_roles'] = array(
'fields' => array(
'uid' => array('type' => 'int', 'unsigned' => TRUE, 'not null' => TRUE, 'default' => 0),
'rid' => array('type' => 'int', 'unsigned' => TRUE, 'not null' => TRUE, 'default' => 0),
),
'references' => array(
'users' => array(
'type' => 'belongs_to',
'join' => array('users' => array('uid' => 'uid')),
),
'roles' => array(
'type' => 'belongs_to',
'join' => array('role' => array('rid' => 'rid')),
),
),
);
?>
<?php
/**
* node.schema
*/
$schema['node'] = array(
'fields' => array(
'nid' => array('type' => 'serial', 'unsigned' => TRUE, 'not null' => TRUE),
'vid' => array('type' => 'int', 'unsigned' => TRUE, 'not null' => TRUE, 'default' => 0),
'title' => array('type' => 'varchar', 'length' => 255, 'not null' => TRUE, 'default' => ''),
'uid' => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
),
'references' => array(
'user' => array(
'type' => 'belongs_to',
'join' => array('users' => array('uid' => 'uid')),
),
'revision' => array(
'type' => 'single',
'join' => array('node_revisions' => array('nid' => 'nid', 'vid' => 'vid'))
),
'revisions' => array(
'type' => 'multiple',
'join' => array('node_revisions' => array('nid' => 'nid'))
),
),
);
// Extend 'users' table schema describing nodes ownership.
$schema['users']['references']['nodes'] = array(
'type' => 'multiple',
'join' => array('node' => array('uid' => 'uid'))
);
$schema['node_revisions'] = array(
'fields' => array(
'nid' => array('type' => 'int', 'unsigned' => TRUE, 'not null' => TRUE, 'default' => 0),
'vid' => array('type' => 'serial', 'unsigned' => TRUE, 'not null' => TRUE),
'uid' => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
'title' => array('type' => 'varchar', 'length' => 255, 'not null' => TRUE, 'default' => ''),
),
'references' => array(
'node' => array(
'type' => 'belongs_to',
'join' => array('node' => array('nid' => 'nid'))
),
'user' => array(
'type' => 'belongs_to',
'join' => array('users' => array('uid' => 'uid'))
),
),
);
?>
<?php
/**
* comment.schema
*/
$schema['comments'] = array(
'fields' => array(
'cid' => array('type' => 'serial', 'not null' => TRUE),
'pid' => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
'nid' => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
'uid' => array('type' => 'int', 'not null' => TRUE, 'default' => 0),
'subject' => array('type' => 'varchar', 'length' => 64, 'not null' => TRUE, 'default' => ''),
),
'references' => array(
'parent' => array(
'type' => 'single',
'join' => array('comments' => array('pid' => 'cid'))
),
'node' => array(
'type' => 'belongs_to',
'join' => array('node' => array('nid' => 'cid'))
),
'user' => array(
'type' => 'belongs_to',
'join' => array('users' => array('uid' => 'uid'))
),
),
);
// Extend 'node' table schema attaching comments comments.
$schema['node']['references']['comments'] = array(
'type' => 'multiple',
'join' => array('comments' => array('nid' => 'nid'))
);
// Extend 'users' table schema describing comments ownership.
$schema['users']['references']['comments'] = array(
'type' => 'multiple',
'join' => array('comments' => array('uid' => 'uid'))
);
?>
Reference types:
single: describe one-to-one relationships, e.g. a comment has none or one parent comment, a node has one current revision (could be called 'one', 'has_one', etc too).
multiple: describe one-to-many relationships, e.g. a user has many nodes/comments (could be also called 'collection', 'has_many', 'has_multiple').
belongs_to: define a ownership relationship (one-to-one/many-to-one?), e.g. a node is owned by an user, a comment belongs to a node (also could be called...).
Reference properties:
through: allow declaring an intermediate table for joining objects, usually used in many-to-many relationships, where a intermediate table is needed for data normalization/modeling.
join: set the referenced table mapping fields relationships.
As our main goal is not reference integrity. I'm eager to see use cases where this schema relationship definition doesn't suffice.
Reference:
http://wiki.rubyonrails.org/rails/pages/has_one
http://wiki.rubyonrails.org/rails/pages/has_many
http://wiki.rubyonrails.org/rails/pages/belongs_to
http://blog.nanorails.com/pages/rails-1-1-cheat-sheet
Look at Views, and data structure considerations
Chatted yesterday with chx about this topic. Some thoughts based on what we discussed:
We should look at the table relationship mapping in Views. It's something that's already in use and many developers are familiar with. At the least, it's a valuable point of reference.
We should think through the link between how we map relationships and how we structure our data objects/arrays. Take the user object (or nested array). A user has various roles. How do we want to map these in a user object/array? The only direct relationship is with the users_roles table. Do we want that in our object structure? Probably not. It's the relationship between objects that we're interested in. Likely we'll want something roughly like:
<?php$user['roles'] = array(
array(
'rid' => 3,
'name' => 'subscriber',
),
array(
'rid' => 4,
'name' => 'member',
),
);
?>
So, perhaps, in the schema we'll want to map object relationships--not just the direct foreign key links. In this light, Recidive's suggested approach seems worthy of some careful thought and consideration. There may be more there than we need, the terminology may need work, but the basic idea is something we should be considering.
New terminology
Hi, in our draft record module we used
hook_schema_alter()to extend schema information adding references with a different terminology:<?php
function node_schema_alter(&$schema) {
// Add reference information to node table.
// This would go direct on node_schema() in Drupal 7.
$schema['node']['references']['user'] = array(
'type' => 'one',
'join' => array('users' => array('uid' => 'uid')),
);
// Extend 'users' table schema describing nodes ownership.
$schema['users']['references']['nodes'] = array(
'type' => 'many',
'join' => array('node' => array('uid' => 'uid')),
);
}
?>
Note: I've striped it down (e.g. removed node_revisions references) to avoid an even longer comment.
Having two way references, i.e. node table referencing users table and vice-versa, is good because we can detect the reference cardinality, e.g. node (table) -> user (reference) is many to one users (table) -> nodes (reference) is one to many (note the plurals on the reference names).
For the example Nedjo provided above (users -> roles relationship) we could use this terminology:
<?php
function user_schema_alter(&$schema) {
// Add reference information to users table.
// This would go direct on user_schema() in Drupal 7.
$schema['users']['references']['roles'] = array(
'type' => 'many',
'join' => array(
'users_roles' => array('uid' => 'uid'),
'role' => array('rid' => 'rid'),
),
);
// Add reference information to role table.
// This would go direct on user_schema() in Drupal 7.
$schema['role']['references']['users'] = array(
'type' => 'many',
'join' => array(
'users_roles' => array('rid' => 'rid'),
'users' => array('uid' => 'uid')
),
);
// Add reference information to users_roles table.
// This would go direct on user_schema() in Drupal 7.
$schema['users_roles']['references']['users'] = array(
'type' => 'one',
'join' => array('users' => array('uid' => 'uid')),
);
$schema['users_roles']['references']['roles'] = array(
'type' => 'one',
'join' => array('role' => array('rid' => 'rid')),
);
}
?>
Note that users -> roles relationship has a auxiliar table 'user_roles' and this is the first on the 'join' property. We can use this notation to be able to construct the user object 'roles' property as
$user->rolesinstead of$user->users_roles->roles.Some additional points
$node->user->nodesfor performance reasons. Also this can cause infinite loopsThese points possibly is going beyond the scope of this thread, but I think they are relevant since we need to know how this reference information could be used, or how it could help us before choosing a route.
We have progress?
Hiya! It's been a year since this subject has come up, and I'm wondering if there's any consensus as to how foreign key notation should proceed.
Table relations instead of foreign key
Today I thought a lot about this issue, because I wanted to add some intelligent datafetching to objects of one of my modules (ResultsAPI).
In my opionion, schema should not explicitly define foreign keys, because this seems to be a special feature of psql. What we want to know, is how the tables are related to each other, so that we can -at the end- get some complex relation structure for specific elements.
Relations
So therefore we should define the complete relation between two tables.
To keep the structure simple I think we only have to define direct(!) relations of tables. By combination of this direct relations we can get complex relations.
<?php
$schema['node']['relations'] = array(
'author'=> array('table'=>'users','join'=>'%1.uid = %2.uid'),
'revision'=> array('table'=>'node_revision','join'=>'%1.nid = %2.nid'),
'votingapi_vote_single'=> array('table'=>'votingapi_vote','join'=>"%1.nid = %2.content_id AND %2.content_type = 'node' "),
);
$schema['node_revision']['relations'] = array(
'author'=> array('table'=>'users','join'=>'%1.uid = %2.uid'),
);
//I used the syntax of http://groups.drupal.org/node/4328#comment-13486 with %1 for left table and %2 for right table (specified with 'table')
?>
Relations will be explicitly named, so they needn't be called like the tables they are applied to. This is necessary, because one table could have two different relations to another table. For example in a {content_type_xyz}, there could be defined two or more user_reference fields. So we had two relations to table {users}. We'd call them 'ur_field_a' and 'ur_field_b' or something like that.
Join attribute
Using a string for join could be some barrier for database abstraction, because then you could use for example some explicit MySQL functions.
Or we'd provide an "either string or array"-join, like:
string: 'join'=>'string...' ---- valid for all databasesarray : 'join'=>array('mysql'=>'...','mysqli'=>'....')
to provide abstraction functionality.
Most of my thoughts were about querying data, so I'did not think about relation types (1-1, 1-many, many-1, many-many...).
Combining relations
To get complex relations we have to combine the direct relations.
Some examples:
names of relations will be prefixed by #
we start from a table (not a relation)
{node} #author --> calls table {users} - the author of the node{node} #revision --> {node_revisions} - all revisions
{node} #revision #author --> {users} - Node revisions' authors
{node} #author #usernodes --> {node} - All nodes of the node's creator
{node} #revision #author #usernodes --> {node} - All nodes of the specific node's revisions' users'
{node} #revision #author #userrevisions --> {node} - All noderevisions of the specific node's revisions' users'
{node} #field_a --> {content_type_xy} or {content_field_a} - all values for field_a of the node
{node} #field_a #referenced_user --> {users} - all referenced users in field_a of the node
{node} #author #userroles --> {user_roles} - role id's for the author's roles
{node} #author #userroles #role --> {roles} - role objects for the author's roles
By using %1 and %2, multiple instances of the same table can be used.
Querying data
With having the definitions of relations, we can query (SELECT, INSERT, UPDATE, DELETE) complex datastructures, by simply defining the relations that shall be applied.
Example:
User with all role relations (not objects), nodes and their revisions
<?php$tables = array(
'#table' => 'users',
'userroles'=>1,
'usernodes'=>array('node_revisions'),
);
?>
To apply some values for the WHERE part in the query we could use a $vars-array
example: only users with role-id 1 and nodes of node-type 'xy'
<?php$vars = array(
'users#userroles'=>array('rid'=>1),
'users#usernodes'=>array('type'=>'xy'),
);
?>
As keys there could be used different types:
* without prefix: (users#userroles) - absolute structure
* prefix # : (#userroles) all instances of relation #userroles
* prefix * : (*user_roles) all instacnes of table {user_roles}
For values there also could be added some '#op'-attribute, like =, !=, LIKE
Or there will be used again strings like in 'join'-strings:
array('%users#userroles.rid = 1', "%users#usernodes.type = 'xy'");
with sprintf syntax:
array('%users#userroles.rid = %d', "%users#usernodes.type = '%s'"), array(array(1),array('xy'));
Role of views module
when we could realize such a structure, the only role views have to pay attention is presenting the data, I guess.
foreign keys in Drupal 7
I know this is an old thread, but maybe somebody else will come across it like I did while trying to find out Drupal's support for foreign keys.
While at this point there's still some controversy about what to do with them,
http://drupal.org/node/111011
the Schema API in D7 does support foreign keys,
http://api.drupal.org/api/group/schemaapi/7
http://educoder.org/
http://openeducationresearch.org/
Referential Integrity - Foreign Keys in the database
Hi,
I read this post about adding foreign keys support into the Drupal 7 core: http://drupal.org/node/111011, it seems they finally added it in the Schema API, and in this other post http://drupal.org/node/224333#foreign-keys-added they explain how modules have to change, according this new feature.
For instance, the following was added to
forum_schema()inforum.installto let Drupal know that the nid and vid fields in this table are related to the same fields in the node table:<?php'foreign keys' => array(
'nid' => array('node' => 'nid'),
'vid' => array('node' => 'vid'),
),
?>
So I wanted to check with my own eyes, and I downloaded the latest development version of Drupal 7, then I unpacked it, and I saw that many modules have foreign keys in their schema, like the code above. I installed it, and I checked the physical database, to see if there were foreign keys constraints in it, well I have been disappointed to not find them there...
So, the physical database of the current development version of Drupal 7 still hasn't foreign keys constraints, even if they are in their modules schema.
At this point, my question is: will the production release of Drupal 7 bring the referential integrity physically into the database?
regards
Michele
Useless?
Yes, adding this 'foreign keys' parameter is useless IMO. Some of the reasons why this is useless is note here.
False
Just because we can't use ON DELETE CASCADE in the database itself doesn't make foreign key metadata useless. Rather, we focus on user-space stuff we can do with that extra knowledge, such as improved Views automation. Have a look at the table wizard module.
Ok, I meant to say 'foreign
Ok, I meant to say 'foreign keys' info is not enough. We need extra data to make this really usefull, such as cardinality. You can't use all 'foreign keys' as LEFT JOIN for performance reasons, the same way you can't use them all as INNER JOIN as well since you risk not getting all data you want.
By having a quick look at table wizard module I see it relies on 'left' and 'right' information to do it's trick.