Field Structure

KarenS's picture

The way that fields are structured in CCK now is that any field that is multiple or shared has its own separate 'per field' table, and all other fields are grouped together in a 'per content type' table. Querying this data to create a node can be expensive, so the serialized node is cached and the cached data is used during node_load().

That structure requires a lot of complex data manipulation when fields are shared or unshared or changed from single to multiple or multiple to single. When those things happen, the schema must be altered and data must be migrated from one table to another. This makes the code quite complex and introduces the potential for data loss and other errors.

Our conversations at the Design Sprint resulted in an initial idea of simplifying this so that all fields are stored in 'per field' tables. This will greatly simplify the code, making it easier to get it into core. The hope was that the caching model would alleviate most of the performance problems that might result. But there is still a performance concern with this model.

Another approach to this would be to go ahead and store the basic data in simple 'per field' tables, as we discussed, but instead of storing a serialized array in the cache, create multiple tables at an intermediate level that can be queried and filtered easily without any need for joining data. Those intermediate tables will contain duplicates of the basic data stored in the 'per field' tables, but the cost of keeping that data up to date is probably less than the cost of complex joined queries.

If we did that we would no longer store the serialized data in the cache, so the cache_content table would not be needed. Once fields are moved to core, these intermediate tables would probably also eliminate the need for the revisions table.

The new data structure might look like the following. The idea is that each node type would have a table with all its fields and each node might have multiple rows in the table to represent all the delta values of all its fields. Any field could potentially contain multiple values, so any node could potentially return multiple rows and node handling would need to take that into account. But these tables would be easily queriable for any of their values without any joins.

Note that the idea of adding an intermediate level of tables is Dries' idea, this implementation of this idea may or may not be anything like what he was thinking :)

INTERMEDIATE LEVEL

table: node_page

nid vid delta uid body teaser field_phone_value field_file_fid
1 1 0 1 xxx xxx 555-5555 34
1 1 1 NULL NULL NULL 777-7777 NULL
2 2 0 1 xxx xxx 555-5555 46

table: node_story

nid vid delta uid body teaser field_name_value field_nodereference_nid
1 1 0 1 xxx xxx xxx 34
1 1 1 NULL NULL NULL xxx NULL
2 2 0 1 xxx xxx xxx 46

BASIC LEVEL

table: field_name

nid vid delta field_name_value
1 1 0 xxx
1 1 0 xxx

table: field_phone

nid vid delta field_phone_value
1 1 0 xxx
1 1 0 xxx

table: field_city

nid vid delta field_city_value
1 1 0 xxx
1 1 0 xxx

table: field_file

nid vid delta field_file_fid
1 1 0 34
1 1 0 46

table: field_nodereference

nid vid delta field_nodereference_nid
1 1 0 46
1 1 0 45

Comments

Interesting. Would Views use

moshe weitzman's picture

Interesting. Would Views use these tables? I think the optimizations here and the "all fields in are in own table" are pretty lousy for Views. We have to consider the Views use case just as much as the node load use case. I actually think the D5 way of doing these is the best performing for node listings (fewest joins).

Views could use the

KarenS's picture

Views could use the intermediate tables I think to join in a single table and get any fields it needs from that table. The only place that would need special handling would be for fields that are shared between content types, which is most likely the exception rather than the rule. Even then, it would just require an additional join for every content type the field lives in, or perhaps adding some logic in the query handler to figure out whether joining in the field table or the intermediate table would be most efficient.

Or maybe each view creates its own intermediate table that can be synched periodically from the lower level tables in the same way the node tables are.

The case where a field is

moshe weitzman's picture

The case where a field is used in multiple content types gets very hard for Views if we say that "Views uses the intermediate tables". It now has same CCK field in multiple database columns. So a solution would be for CCK Views integration to declare the the authorititative table for these shared fields but now we are in the dodgy pattern we are trying to get out of. I will say we are confining to the dodginess to Views integration which is a more manageable problem than the current dodginess in CCK CRUD.

Intermediate tables for each View is a fascinating optimization. Hmmm.

Fields shared between

catch's picture

Fields shared between content types - I don't see this being an exception when title, body, teaser etc. are all in this data model.

Not a big fan

Crell's picture

I have to say I'm still not a fan of a variable schema, even if the variable schema is more predictable. I end up with edge cases where I need to do a manual query against CCK fields all the time (to avoid the expense of a node_load(), which is a lot more than just CCK). There's probably a more elegant way, but I haven't figured out one that doesn't take forever to implement and even longer to understand. (And such edge cases always come up at the very end of a project, when I have no time to figure out a proper, elegant solution, of course.) If the schema has any potential to change on me, that gets very brittle. Duplication of data is just as much of a problem in a different way; The queries I'd need to write against the intermediate table would also be irritatingly complex to handle some other field, besides the one I'm interested in, being multi-value.

Any manual attempts to edit data are also then twice as complex, because there's twice as many tables to keep in sync.

There's also the question of storage space. All of this duplicate data could very easily double the size of the database. That has all sorts of negative implications, including disk space (shared hosts are still the majority of web sites, remember), backup speed/space, index size (with performance implications there as well), and probably others I'm forgetting.

Well, you've pretty much

KarenS's picture

Well, you've pretty much thrown out every possibility :)

I don't think we can get good performance and use minimal storage space and have simplified data handling.

My guess is storage is the cheapest and most available resource. I doubt anyone on a shared host has a huge amount of data. If they do they're likely getting too big for shared hosting anyway. Even the current complicated data management has lots of joins, hence performance issues.

But we're all just guessing. The question is exactly how much of hit there is from numerous joins (especially in Views) vs the cost of duplicating data to avoid those joins. And to what extent those results change as you move from smaller to larger to extremely large installations. I don't know how to test that.

Also, keep in mind that data is not edited nearly as often as it is viewed, and some data is never altered, so some extra cost during editing is probably an acceptable trade-off to things that consume resources when viewed (like most Views views).

Here, here!

zoon_unit's picture

As an old grizzled data guy from the days when data storage efficiency was king, I can freely say that times have changed. Data storage is now cheap, and we have ways to store our backup data in compressed form, which eliminates much of the "empty" space.

My experience with shared hosts is that they usually give you far more space than you need. It's bandwidth and processing strain that limits your account. And these are also the issues that impact performance, the thing that the viewer sees. Viewers don't see wasted data space, and could care less.

I would MUCH rather have a fast performing Drupal site that could run on a low cost shared site with a de-normalized database, than the other way around. This, after all, is the whole reason for the data warehousing business, to make data more accessible in exchange for lack of data normalcy and excessive size.

I don't understand the particulars of Drupal data structures that much, but it would seem that caching a CCK node and serving up the fully assembled and cached version to users would provide the biggest performance advantage. In most websites, data is ESVM: "edit seldom, view many"

A variation of an idea

bjaspan's picture

I have some experience with this idea; it is related to the "loading nodes in a single query" path I went down a year ago when I first created the Schema API (http://drupal.org/node/136171#comment-541282). In my case, I was building up a single query with many inner or left joins to load all CCK fields and everything else related to a node id, so I could not make the simplification of merging rows based on 'delta' (not-CCK data does not have a delta) and I doubt I actually handled vids right. Anyway, I ended up abandoning the work in part because I realized the PHP CPU overhead of deciding which columns of which rows to keep in the case that I had multiple multiple-value fields in a single query was really substantial and it was not at all clear I would come out ahead in the end. After all, all the joins are based on primary key and are very fast.

In this case, you are effectively pre-computing the single-query SELECT (that I was performing dynamically) into a table. In principal I have no objection to this. It is certainly no worse than performing the SELECT and storing the results as serialized data, and could have benefits because the data is much higher fidelity in a table structure than as serialized data.

I observe that you'll still have some PHP CPU cycles to interpret the results of querying that table. It is not clear that you can do that work more efficiently than the database could via a normal SELECT with joins and since this is an optimization you have to consider that overhead very carefully.

I'm not sure I agree with Larry's concern about a dynamic schema being hard to deal with w.r.t. writing queries, loading data, etc., because you should never HAVE to use these intermediate tables; they are just a cache. You can work in terms of the base tables and, if you change them, call magic_update_intermediate_tables_for_node($nid) or whatever (e.g. "cache_clear(...)"). I'm also unconvinced by the "doubles data size" problem because we already have a CCK field cache (in addition to the page cache) so we'd just be replacing that with something of about the same size (well, there would be all those NULL values...).

However, I do still stand by all the reasons we wanted to eliminate a dynamic schema from CCK in the first place. It is fragile, complex, etc., etc. We certainly could consider it but, as it is an optimization, we need to know (a) that we had a performance problem and (b) that this idea helps with it. At the moment we do not know either.

In short, I'll say it's a definite "maybe" but this probably constitutes putting the cart before the horse.

Well, as we already stated,

yched's picture

Well, as we already stated, the performance concern about the 'one table per field' approach mainly involves Views-style direct queries. The cached node field data does a fine job in hiding the complexity of the request for node_loads.

So, a few remarks about the Views side of things :

  • If we want those intermediate tables to be the ones queried by Views, then they can't be considered as another form of cache (may or may not have the data), but as an exhaustive copy of the primary tables.

  • In order to avoid duplicate nodes, multiple fields are currently (D5 and D6) retrieved in a separate query (one per field) after the Views-generated query has selected the nids, - that's the 'group multiple values' option, most common use case IMO (That's also a point to consider when making every field 'potentially multiple'). This largely exceeds the cost of a join, and I'm not sure how the proposed structure would enhance that.

This 'post query' is currently not done for CCK fields used in filters, sorts, etc, but those are precisely known to be problematic wrt duplicate results.

  • For Views as it currently stands in D5 and D6, grouping fields inside a common table makes no difference anyway, since the generated query will still join against one separate instance of the table for each field. I'm afraid Views D6 will even make things slightly worse on this aspect, since every different filter proposed by a field has to live under a different views alias (one alias holds 1 (view-)field + 1 filter + 1 arg + 1 sort).
    Maybe the PDO layer in D7 and the subsequent work on a Views query builder in core will make things different, but until then I agree with Barry that this is probably prematurate. For now I mainly see this as a way to move from a serialized cache to a structured one.

Denormalisation on demand

catch's picture

I discussed this briefly with KarenS and Crell and it didn't get shot down (although I did get some funny looks) so I'll try to explain more here.

As far as I can see the main issue with "every field in it's own table" (which personally I like a lot for simplicity) is when it's being used for WHERE or ORDER BY queries against different tables - because then indexes become useless, MySQL can't cache the query etc.. Just straight selects there might be a bit of a hit, but I can't see it being such a big problem

So Ithought that if it's possible to track if a field is being used in those queries, we could denormalise the data into the node table (or wherever) and run the query against there instead. The denormalised data would be a copy - we keep the original table, just don't query against it, and use triggers or something to maintain data integrity. Could also track for when the field drops out of a query so that column could be cleaned up on demand as well.

Quick example:

I create a date field, which is attached to user profiles and stores their birthday. For a year, all I do is show user's birthdays on their profile pages - not such a big deal. Then I make a view using that field to order my users by age.

For the first case, there doesn't seem to be so much of a use case for duplicating that field into a meta-table just for views - the redundant storage, indexes, inserts won't buy us anything. For the second it could potentially buy quite a lot (node_comment_statistics and the tracker and forums queries being particularly nasty examples of normalisation causing performance issues at the moment).

Of course I don't have suggestions on how to actually do this, but it seemed like a compromise between some of the other issues (and could maybe be handled in contrib somewhere?).

If we assume that we are

neoliminal's picture

If we assume that we are going to end up with triplets (truples, 3-tuple), then we can assume that at the very least we will have to combine three columns of data. Subject / Predicate / Object. To produce a "fact" we need to combine these.

Now let's break things down a bit.

I create a field called First_Name. At this point the best I can do is make a 2-tuple. If I enter John as the first value my "fact" is John is a First Name... which is not likely what I was attempting to say.

When I connect a user ID to this field I've now created an interesting fact. (more than one, actually, but I digress)

[uid 3] [has the first name] [John]

This is all well and good. We can create these "facts" on the fly because of views or another module. The stickler. The part that appears to always be a problem, is the Predicate! It's not enough to simply call the field First_Name. You have to find a way to show the Predicate properties.

What looks dangerous to me is the idea that it will be easy to define these predicates, without the need for predefinition, based on some external database. Since we want Predicate universality we have to rely on the fact that everyone who ever creates a First_Name field means the same thing. That's an assumption. So it's an open question to me how to assign properties to the Predicate.

This field First_Name can now be associated with a lot of information when connected to a uid. Associations between "facts" become possible where another field called Email_Address is connected to uid and allows for

[uid 3] [has email address] [jklewis@example.com] AND [has the first name] [John]

So is what we are looking for really reworking the current fields system? I question that. Assigning actions that Predicates can perform. I think we already have subjects and objects taken care of with CCK. It's the Predicates that appear to need attention.

N'est Pas?

--
John Kipling Lewis

I might be on crack...

webchick's picture

...but what if we used Views (not Views module but this kind: http://dev.mysql.com/tech-resources/articles/mysql-views.pdf) to build an aggregate table for each content type, consisting of the individual fields, and queried against that to avoid the joining problem? Are we requiring MySQL 5 in D7? If so, this may be an option.

Anyone know enough stuff about databases to know if this:
a) solves the problem, or if I've totally misunderstood what this is for
b) is cross-database compliant? It looks like PostgreSQL, SQLite, and Oracle

EDIT Yes, I was indeed on crack. Thanks folks for straightening me out. :D A View is nothing more than a "short cut" to just running the queries directly; the join performance overhead is still present. Nevermind.

Good question

recidive's picture

The CREATE VIEW AS SELECT... syntax seems to be compilant. In MySQL, views are derived tables, so in some cases they achieve good performance, in others cases they performs very bad. I did some testing myself some time ago. This article served to me as an eye-opener: http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance....

// Are we requiring MySQL 5

catch's picture

// Are we requiring MySQL 5 in D7?

We certainly are: http://drupal.org/node/220329

Maybe

Crell's picture

SQL Views are a possibility. However, they do have performance issues on MySQL still, and even more importantly for us they require a separate permission in order to create them. We have no idea at this point how widespread support for that permission is on shared hosts, so I don't know if we can even use them yet. :-(

MySQL views offer no performance gain

David Strauss's picture

MySQL views are not a solution. They simply abstract away the query below. MySQL internally expands any referenced view to what it represents beneath. If we had something like Oracle's materialized views, we would have something worthwhile to play with.

True

alex_b's picture

I can second that from personal experience.

My feedback

David Strauss's picture

The basic/intermediate table combination is what I was going to propose at Drupalcon for RDF storage. (My RDF storage architecture BoF session never materialized.) The idea was to have a combination of:
(1) A rigid, higher-performance storage location, used as often as necessary. This is basically the "intermediate" model shown above. These tables would be superior to (2) in every way except lacking the ability to store arbitrary fields.
(2) A maximally flexible, lower-performance storage location for anything not defined in (1). Here, I would simply use a table for each major type (int, varchar, etc.) with one column being the field name. With the right indexes, there's no real performance loss over having a separate table for every field.

The data API would know whether a rigid, high-performance storage location was defined for any given data. (2) would only be used if (1) cannot support the storage of data to be stored.

Can you expand on this idea?

KarenS's picture

Can you expand on this idea? Would you duplicate all data in the intermediate tables and how would you keep them current?

No duplication

David Strauss's picture

I would either keep data in the rigid, defined part of the schema or the "dynamic" part. There would have to be a tool to migrate the data back and forth.

Hrm

joshk's picture

Does that put us back at square one, though? E.g. how does this move forward from the existing paradigm of CCK's (awesomely functional, but hard to move towards core) on-the-fly normalization/denormalization? I'm likely not "getting it," but maybe you could expand a bit on the nature of your rigid/dynamic parts and how they'd be used?

http://www.chapterthreellc.com | http://www.outlandishjosh.com

Well, the alternative to

David Strauss's picture

Well, the alternative to putting data in flight between a pseudo triple-store and normal tables is duplication of data. This would put everything in the triple-store but also copy some data to normal tables for efficient indexing. This approach would be very viable with triggers, which are available to us in Drupal 7.

Possible to make an example..?

webchick's picture

Is it possible for you to make an example like in Karen's post above w/ HTML tables representing data? I'm having a hard time understanding your suggested approach. :(

@davidstrauss - i was also

moshe weitzman's picture

@davidstrauss - i was also thinking that triggers would be very useful here.

i wonder if we could just accept the solution where we put every field in own table. for those that want it and have mysql5, a contrib module creates intermediate tables that are populated via triggers. that module also does an alter on the views field/filter/sort/argument definitions (or creates its own parallel universe).

FYI, bechmarking is not that hard in and of itself. You just need the 'ab' program which comes with apachde. See docs at http://drupal.org/node/79237

If I got what you said

recidive's picture

If I got what you said correctly. In our scenario, node title and other fields that are rigid (i.e. doesn't have multiple values and cannot be shared) would be kept on the rigid table, e.g. our current node/node_revisions table, and fields that can't be stored in that model would be on per field type basis, in other words, a table per field type. So to illustrate:

node_revision
nid vid title
  1   1 'Some page'
  2   1 'Some page with 2 attachments'

field_text
nid vid fiid delta text
  1    1    1       0 'Some text data'
  2    1    1       0 'Some text data for the page with 2 attachments'

field_file
nid vid fiid delta fid
  2    1    2       0   1
  2    1    2       1   2

(Note: fiid would be a 'field instance id', so we wouldn't need to duplicate node_type and field_name values)

This approach looks closely to what we have for node -> files tables relationship. So maybe the the field_file table wouldn't be necessary.

The pros of this approach are a) no variable schema, b) possible less tables to join to construct the node object (doing a SELECT * FROM field_text where nid = 1 and vid = 1 would return the data of all fields instances of the type text for a node), c) normalization.

One table per field group?

webchick's picture

This was something interesting Allie suggested... which I'm going to invariably mess up, so hopefully she'll post back and correct me. ;)

Basically, she discussed moving the concept of field groups to be more central and creating a table per field group. For example, the field_address table could consist of columns like address1, address2, city, state, postalcode, and country. And the field_profile table could consist of first name, last name, age, gender... then the "Customer" content type would join on the field_address and field_profile tables tables rather than a table for each individual field.

I think the idea has merit, because that would mean for a typical content type, you might only have 2-3 joins, rather than potentially a dozen or more. I'm not sure implementation-wise how that would work, though (nor UI-wise).

Storing the group in a table

KarenS's picture

Storing the group in a table is fine and would reduce the number of joins, but will only be an improvement if we don't allow you to move fields from one group to another. If you move fields around, we're back to doing what we do now, trying to change the schema and migrate data without losing anything :(

Storing the group in a table

Allie Micka's picture

Thanks to webchick for posting this, 'cause I was at serious risk of forgetting this one on my own :)

Anyway, I've always been a little put off by the complexity - and duplication of effort - for my most common field use cases.

Generally, I'll want to create a "contact information" field type, which would include First Name, Last Name, Organization ( e.g. node reference ), Addresss, Phone, Driving Directions, and so-on.

Once I have this collection of fields, I find myself wanting to re-use them on various content types: Profiles, Organizations, Events, etc. When I cited this example, Angie piped in with "Album" or "Track" information, which might also be used across content types.

From a management perspective, it's much easier to work in this manner than to create, assign, and re-order a disparate collection of fields from content type to content type. Also, if this "complex field" were a first-order field in itself, you could actually re-use the same collection of fields on one node type ( think "Event sponsor contact" vs. "Event location contact" using the above field list on the same node. ).

I think we've all had to hack together an intermediate node type using node references, when the only driving intention was to reuse a set of fields. For example, if I want a "track listing" on an album, each track might have an artist, a duration, and a title. To accomplish this on today's terms, I have 3 options:

1) Add the artist, duration, and title fields to the "album" node type, and then try to line up delta's to align each track's info. Nobody does this, because it would be insane and nigh impossible to group and theme.

2) Write a field module that produces and stores these 3 values. This requires some serious developer chops and a little messy, because e.g. you'd really like to use nodereference for "artist", decimal for "duration" and text for "title".

3) Set up a node reference field to a "track" node type. Maybe this is useful because I want ratings, comments/etc on each track. But for the purposes of this argument let's say it's not useful at all ( in many cases, it's not ). So now we have this whole system built around finding tracks, creating them on the fly using some kind of subform, setting up Viewfield/etc. to list the albums and tracks on both ends, etc.

Meanwhile, you lose out on the introspection that's useful in Views or other queries. If I want to query for "all of the albums with tracks by David Bowie", I must noderef from "album" to "track", then noderef from track to Artist, and THEN limit the query by artist's name. This is presently impossible in the Views UI, so I'm back to writing custom code or queries.

What I'm proposing is a 4th option:

4) Create a complex field type ("track") containing "artist", "duration", and "title" fields, and add this as a multi-valued field "album tracks" on an album content type. Already, I've saved myself a dozen node_loads (in contrast to #3), and, in "one table per field" land, I've saved myself at least 2 joins. Plus, I have a much more themeable and sortable collection of fields to work with. Advanced themers can write just 1 display formatter for the whole group. If I add my track field to other node types, I can also save myself the headache of re-propagating additional sub-fields ( e.g. "track rating" ), should I choose to add them later.

This doesn't save me from the "David Bowie problem", but it does save me from at least one nodereference hop, and one could more easily conceive of building a sensible UI.

Rather than wandering too far off base ( and possibly into "what should/shouldn't be a node" no-man's land), I'm going to stop here and suggest that a complex field type can probably do a lot more good than harm.

@Karen, I totally agree that you shouldn't be able to deconstruct a complex field and move its parts around. That would be a major headache and a data disaster. But what do we give up by disallowing that?

From my perspective, making complex fields a first-order field type makes the "one table per field" a practical solution for more cases. Basically, it changes nothing you've already decided - you can still create 1 table for each "simple" field type. But for more complex models, and cases where you were always going to join against all of the component fields anyway, it's both a management and a performance win.

Has potential

Crell's picture

I think the "fieldset" is a red herring here, as that's strictly a UI question. The take-away I'm seeing from Allie's post is that if we make "compound fields" really easy, then a given node is likely to not have more than 6-7 "fields" in the worst case, rather than 15-20 in the worst case. That removes the main barrier to table-per-field, which is proliferation of tables. As a bonus, easy compound fields is a regular feature request anyway. :-)

I've been wanting a combo

KarenS's picture

I've been wanting a combo field for a long time and tried to write the D6 widget code in a way that would make it possible to do, so this just adds another reason to do it :)

To use this as Allie suggests, the combo field would need to be sharable between content types, so data stored in combo fields will be most efficiently stored in a separate per-combo-field table.

But we still will have the one-off fields to deal with.

how far away? How can we help?

empee584's picture

Hey Karen, just tripped over this.. what is the state of things regarding this? i'd like to help but don't know how..

We already sort of have complex single-value fields

nedjo's picture

Complex fields (collections of different fields) is basically what we have now in our content type tables for non-multiple fields. We have a collection of fields stored together in a single table. The difference would be that we could attach these complex fields to more than one content type rather than only to a single one. Which is a worthwhile idea. Definitely there's a strong use case for collections of fields/attributes that can be applied to more than one content type.

But it doesn't address the basic question of how to handle single-value vs. multi-value fields.

Take Allie's example of a contact record. One field is an "organization" nodereference. Say this started out single-value field, but now we want to be able to link to more than one organization.

We're back where we started.

So it looks like the questions of complex fields and optimal data storage approaches are both priorities, but they're distinct.

Not nested?

Crell's picture

I think it introduces way too much complexity to have a compound field allow some of its subfields to be multi-value within it.

If we insist that within a given compound field each sub-field is unique, then it becomes easy. the "Address" compound field is a combo of "street" (text), "city" (text), "state" (text with select box), "ZIP" (integer). "Address" gets its own SQL table, always. But then it's one table, not 3, and can be shared across multiple node types.

Of course, as I was writing this I realized that multiple node references is a valid use case in Allie's example. Each track has "artist" (noderference), "genre" (text), "number" (integer). That's fine, right up until you get to tracks by multiple artists, which happens a lot for soundtracks, for instance.

Drat. :-(

Drat indeed

Allie Micka's picture

We would have found out about this eventually. Props to Nedjo for catching it first!

I still think that this field group thing has some merit, because it still has the potential to reduce the overall number of joins. Quick recap for my own sanity:

By relying on the "one table per field" rule, we eliminate the complexity of single-content-type/multiple-content-type schlepping (YAY!). And we can also address the single-value/multiple-value schlepping (YAY!). However, we decrease efficiency by mandating lots of joins that weren't there before. (BOO)

With the compound field suggestion in place, we are still addressing single-content-type/multiple-content-type schlepping (YAY). We are also addressing single-value/multiple-value schlepping for complex fields (YAY). And we can potentially get back some of the performance of all that joiney-join-join (not to mention the 64-join limitations on mysql et all, should you create a big content type) (YAY!). However, multiple-valued fields within a compound field still require a separate join table (BOO?)

My first reaction to this epiphany was to consider mandating single values for component fields. However, as Larry suggests, this is only going to end in tears. My second reaction is, "why is this a problem?". If I hadn't ever heard of Drupal, and was tasked with storing this type of data, this type of schema is how I would proceed, and it seems most intuitive from an all-purpose web developer's perspective.

As I understand it, the problem is with data-schlepping, not necessarily with table creation. Why not ask about the single value/multiple value thing on component fields, and then lock in the decision? Create the compound field's table with the single-value fields, and a join table for the multiple-value entries; and don't look back.

CCK - or another contrib module - could task itself with changing the single/multipe setting and rearranging the schema accordingly.

Schlepping

Crell's picture

Is a great word to use here. :-)

That aside, it's not just the performance cost of schlepping data around from table to table that is the problem. It's also the unpredictability of where the data will live. Unless you do absolutely everything through node_load() (which you can't as long as node_load() is ridiculously expensive like it is now), you have to periodically hit the actual database directly. To do that, you need to know what tables you need to query against. If those change out from under you, you've got a nice icky bug. CCK has code internally to figure that sort of thing out, but we're trying to kill it for a reason. It's icky, and even ickier if you have to use it outside of CCK. :-)

I'm still of the mind that separate tables and lazy-loading is the way to go here. node_load() becomes a cheap operation, node_view() is a wash, but you can then use the node as an interface to the data in a reasonably cheap fashion. I worry that we're getting into too much premature optimization by trying to build our tables around a "join everything" logic that won't even work with multi-value fields properly, which only makes node_load() still expensive.

Thumb's down on "fooled you" tables

Allie Micka's picture

I'm open-minded, but fairly opposed to intermediate tables and lazy-loading by default. I'd think that a site of any size, this duplication of data at such a low level would get kind of expensive/messy on its own. Also, in the interest of interoperability, I like the idea that your tables "mean something" - and that you could ( whether or not you should ), query or update data independently of Drupal is more appealing than multi-layered obfuscation.

Last, while working with developers and users on getting up to speed with Drupal, one of the biggest barriers/frustrations/rationales for why "Drupal sucks" is that things aren't how an experienced web developer / database architect would expect to find them. FAPI is a prime example where there's a lot of pain but a respectable cost/benefit rationale. Unless there's a seriously-solid argument ( which there may well be ), we're only alienating ourselves from experienced developers whose assumptions lead to temper tantrums.

As others in this thread, I'm advocating that we decide early on to commit to single/multiple on the first pass, and then bake it into the schema. It could be contrib's job to handle changes to this setup and dealing with the resulting fallout/complexity.

RE: Thumb's down on "fooled you" tables

mikeschinkel's picture

+1

Interesting. Not sure how

yched's picture

Interesting. Not sure how this changes the question of multiple / shared fields storage, though.

I don't see a particular

neoliminal's picture

I don't see a particular advantage for this over simply sorting the fields the way CCK does now. The information created by CCK is (normally) already related information.

--
John Kipling Lewis

Correct me if I'm wrong...

eaton's picture

...But isn't the real problem here the fact that CCK allows you to transparently change the schema at any time, after load of data has been entered and the site is running live, without really telling you that anything will change?

The conversation about per-field-tables and per-type tables is the same conversation originally ironed out years ago. Either approach has serious drawbacks, and results in a less efficient system. Why not just force administrators to make the choice about whether a field is sharable, multi-value, etc up front? Now, with UI support for an explicit number of entries in a multivalue field, we could even have several entries for a field in a single table (i.e., a multivalue 'rating' field limited to three entries could be stored as 'rating_1', 'rating_2', and 'rating_3' on the type-specific field).

Don't expose the options for multivalue or shared once a user has created the field, and migrate the profoundly complex schema-conversion-data-migration functions to contrib for the time being. Allow the code to be used for a separate 'cck_field_upgrade' module or somesuch, that handles field alteration as a separate action.

I've maintained for a while that the ability to 'accidentally' radically alter your schema is one of the most dangerous things about the CCK UI; if it's to move into core, why not take the opportunity to strip that dangerous capability from the UI even if it's technically possible in the UI?

I wouldn't mind getting rid

KarenS's picture

I wouldn't mind getting rid of the dynamic changes completely but this would mean that once you have a content type constructed you can't later decide to add a new field to it, or change anything about an existing field. Is that really going to be acceptable?

This also brings up the question of what values can affect the schema if they're changed. These are the values that would have to be locked down and made unchangable:

1) field name
2) multiple value option
3) ability to share the field on another content type
4) anything that affects the columns (like whether the text module has a format column or not)
5) sizes -- text field sizes and number field sizes -- they affect the schema and shortening a size could truncate data

I feel like I may be missing something, those are the ones I can think of right now.

Add to that : - Max length

yched's picture

Add to that :
- Max length of text fields - VARCHAR(n) / BIGTEXT (nice to adjust after creation, esp since the current UI makes it easily overlooked, and thus have all text fields created as BIGTEXTS)
- scale / precision on numeric fields
- I think fields like Fullname lets you specify whether you want a 'middle name' column. You lose the ability to change that sort of things.

This being said, I find this idea of deferring those kind of changes to a contrib module rather appealing. In the current scenario, the CCK UI would remain in contrib anyway.

"Defer to a contrib module"

moshe weitzman's picture

"Defer to a contrib module" is not a real solution. We've only changed responsibility for who has to come up with the solution.

I encourage folks to look again at the DabbleDB video. That shows the power of making "schema" changes on the fly. I really think this ought to be a core competency of ours. I get that it is hard, and I get that the 2 man superteam that is CCK can't do it on its own. But the more we accept that xyz can't change once it is created, the further we get from DabbleDB.

I don't really like it when we say that making field changes on a live DB is "dangerous". It is only dangerous if code and queries are insufficiently tested.

Two things...

joshk's picture

Two things jump out at me here. On the one hand, that this isn't a question just about CCK. If all we were doing was loading and showing individual nodes, the answers would be a lot simpler. This is really about CCK+Views, the killer 1/2 punch superstars of Drupal contrib. Maybe earl has some thoughts about this?

Secondly, on the subject of what's "dangerous," I think a lot of that comes from the fact that developers like to build on top of CCK in part by writing queries directly. It's a lot faster (and less buggy, at first) than taking the time to define and write all your own CRUD. But the schema changes there mean things can get broken. Perhaps that's something to address more directly with documentation and best-practices.

I'm not familiar enough with the SchemaAPI, but perhaps this can address this by moving Drupal away from direct SQL queries with a thin abstraction layer to support different SQL backends, and towards the kind of real data/object abstraction that many MVC frameworks support. This would mean doing a bunch of things in core to let developers explore the data model (rather than just DESCRIBE TABLE, etc), but that might ultimately be more helpful.

http://www.chapterthreellc.com | http://www.outlandishjosh.com

"Defer to a contrib module"

yched's picture

"Defer to a contrib module" is not a real solution. We've only changed responsibility for who has to come up with the solution.

Current scenario for 'fields in core' is : Core gets Fields CRUD API and support for (programmatic) module-added fields. User defined fields (CCK UI) stay in contrib.
Then, the cases for 'alter field settings after they're created' only appear if you have that 'CCK UI' contrib module enabled, so it's ok if the code that handles nasty db change stays in CCK UI contrib, it's the only one that'll use it.
Edge-case is : foo.module defines some fields, and wants different settings for them in a subsequent release. The concensus in Chicago was 'foo.module should handle that in its own update'.

It's a non negligeable amount of not-that-cool code that we don't need to bring to core-level - which doesn't mean it's doomed to be 'unreliable' 2nd zone code, for that matter (I agree on avoiding the word 'dangerous' - I mean, it currently works).

The 'CCK UI' module can then be targetted towards more DabbleDb-like features.

Import/export

zoon_unit's picture

This problem could be partially reduced by allowing the user to export existing data and then import it into a new, updated schema. This also requires that a robust import/export capability be maintained. Which it needs to be anyway. CCK turns Drupal into a database management system, and therefore it desperately needs a robust import/export capability. The two should go hand in hand.

Not easily possible

David Strauss's picture

Most people running Drupal have PHP timeouts that kill their scripts after 30-60 seconds. We would either have to limit the capability to small sites or offer ways to export/import in small batches, like phpMyAdmin does.

Starting in D6 we have a

KarenS's picture

Starting in D6 we have a batching engine, which should take care of the timeout problem. But I'm not sure if exporting/importing provides any benefit over what we're doing now -- it's just another way of dynamically changing the database from one schema to another. You will still lose data if the new schema doesn't have room for it, and you are still migrating data around.

I have no objections. ;-)

eaton's picture

1) field name
2) multiple value option
3) ability to share the field on another content type
4) anything that affects the columns (like whether the text module has a format column or not)
5) sizes -- text field sizes and number field sizes -- they affect the schema and shortening a size could truncate data

I've long lobbied that 'multivalue' and 'shared' are two options that should be explicit choices made at the moment the field is created, not preferences to be toggled. They're as fundamental as choosing between an integer and a text field, in terms of the DB structure... at least to my mind.

Agreed, so do I, but that

yched's picture

Agreed, so do I, but that stands from the point of view of someone who knows how CCK internally works. From a user's perspective, it's just seen as a usability issue. There's plenty of that in CCK issue queue...
Doesn't invalidate the 'stricter core features' / 'extended contrib handling' approach, though.

CCK user ideas

sin@drupal.org's picture

I add/delete fields and alter their params frequently during development, but do not remember the case when I need to convert single-multiple value field or share-unshare it after site goes to production. So, based on my own experience, it would be acceptable and safe to include field modification resulting in single ALTER TABLE query into core CCK UI, but retain complex data migration cases to contribs or manual db altering if it helps bring fields into core. I think CCK GUI for ALTER TABLE (drop, change, add, modify column) is 20% efforts resulting in 80% most used features.

As for data structures, I like an idea to have more control on how multiple value fields are stored, indexed and queried, assuming CCK is more developer's tool and not admin's. Possible options include:
0. table per field;
1. rows with NULLs and deltas;
2. eaton's 'rating_1', 'rating_2', and 'rating_3' columns;
3. using integer as bit masks for multiple checkboxes;
4. storing multiple values in one column in serialized form as string.

Benchmarking

KarenS's picture

We still don't have anyone doing any real benchmarking, just lots of opinions. I don't have the time or ability to set up a way to benchmark the alternatives to test exactly how much of a performance hit we're really talking about.

Can anyone provide any actual statistics for some of the alternatives?

CCK Changes by LP coming soon.

gordon's picture

Hi,

Corine and I have done a huge change to CCK for Drupal 5 which allows us to created grouped fields. This included constructing the table of the grouped fields so they are in their own tables.

Also I have included a huge performance increase by build the queries for all all the fields, and was able to reduce the number of queries from 1 per field to 1 per table. This includes the multiple fields which are now only 1 delete and 1 insert for no matter how many rows there are it is still only 2 database interactions.

I am going to be posting this patch in the next week.

I have been trying to track down karens and now webchick to talk more about this.

Gordon Heydon
Lead Drupal Developer - Lonely Planet

--
Gordon Heydon

So we do store

chx's picture

So we do store field-per-table and we have a denormalized "intermediate" table to help. The latter could be kept up to date with triggers. If you need to write a complex query which involes the intermediate tables, I am sure we can build a helper based on Database: TNG which will hand you the necessary parts of the query. Any other problems?

A thousand times no

eaton's picture

So we do store field-per-table and we have a denormalized "intermediate" table to help. The latter could be kept up to date with triggers. If you need to write a complex query which involes the intermediate tables, I am sure we can build a helper based on Database: TNG which will hand you the necessary parts of the query. Any other problems?

If we go down this route, we have surrendered once and for all to unmaintainable architectural complexity. Making our default database structure a textbook case of hyper-normalization, then using triggers to maintain a separate set of tables that denormalize that data (which isn't even POSSIBLE in some cases)... I'm sorry, but I think this is a big, big mistake. It makes things look simpler on paper, but the complexity it adds to the systems and the additional performance implications (triggers don't come for free) are monstrous.

I'm hesitant to go down the

moshe weitzman's picture

I'm hesitant to go down the triggers path too. But I spoke with chx and he credits triggers and sphinx search (i.e. not using drupal's search queries) as "saving" performance for NowPublic. You can turn some very nasty queries into easy ones by denormalizing via triggers. So, I expect the triggers technique to become commonplace for large sites (along with our other tricks like memcache, boost, etc.). We'll start seeing them pepperred around Contrib as well.

That's one solution...

eaton's picture

...But it's also a solution to search, which is a space already crowded with curious solutions. Solr, for example, does crazy and interesting things by maintaining its own search system and effectively doing the time consuming work at index time. If we go all out, and make triggers the mechanism by which the normalized tables are kept in sync, we become absolutely tied to the performance of triggers on every DB platform and configuration -- the consequences aren't just an out of date search index, but what amounts to a corrupted node cache that can't be flushed. Extend the paradigm to users, comments, works of art, and all the other entities being discussed, and we've essentially doubled the work done on every insert, haven't we?

I'm willing to be convinced, but I'm extremely concerned about the implications. Are we officially abandoning support for all systems that don't support triggers?

The good news is, I wanted

KarenS's picture

The good news is, I wanted to stimulate some discussion to make sure we were carefully considering all realistic possibilities, and we have that.

The bad news is, there is no clear answer here :(

Every possible answer has its own set of disadvantages and it's not clear where disadvantages are more acceptable than others.

I also want to reiterate that I don't think we have a big problem with node_load() -- fresh, non-cached node_loads() aren't needed that often and we have a solution (caching the node) that works acceptably for that. The problem is for one-off queries -- Views queries and custom queries that might contain any number of possible combinations of fields.

Anyway, here's where I'm leaning, based on the above conversation:

  • The current UI and API can be used to create a field schema that will be optimized based on the field structure (fields that are marked 'unlimited' or are shared get their own tables, others are stored in a single table (perhaps with a column for each value, as noted by Eaton in an earlier comment). That structure will look like the structure most of us would create if we were creating a custom content type and writing our own code. [addresses the issue of creating the most logical, efficient database structure]

  • That structure is changeable until data is added, and then is locked. [addresses the issue of the predictability of the data architecture and the problems of migrating data]

  • A separate module, which will need to start out as a contrib module but which might be added to core later, will allow you to make changes to the database structure and do DabbleDB-like on-the-fly changes to the structure. That module should include some extra protection, like suggesting (or even performing) database backups before making changes. [addresses likelihood that changes will be needed or desired, sooner or later, and the desire for DabbleDB capabilities, and by being separate will also make it easier to realize when you're doing something that could hose your data.]

That sounds reasonable,

moshe weitzman's picture

That sounds reasonable, Karen. It would be good if we allow field changes after some data has been added but after that data has been deleted. So if I really want to make changes after having started using a content type, I can delete the relevant nodes. Unfortunately that would leave some saved Views in a non working state but thats what we have now. Feasible?

Actually, that's exactly

KarenS's picture

Actually, that's exactly what I was thinking, so long as there is no data you can change the structure, so if you're so unhappy with the structure that you're willing to wipe your tables or you want to handle your own save and replace operation by copying your data elsewhere and returning it to the right place, you could make a change. And you'd be on your own regarding what that does to Views, I guess, unless we come up with a good method of cleaning fields out of Vews.

....And....

eaton's picture

If someone in contrib wants to pick up the task (again) of converting an entire database full of live data to a new schema, they can take on the task themselves via their own upgrade hooks or their own internally tested functionality.

yched, are you around? How

KarenS's picture

yched, are you around? How does this sound?

This would allow you to create a new content type, add a few fields to it to be sure it behaves the way you want, wipe those fields out and make more changes, etc, so it seems flexible enough to allow you to play around and reconsider design ideas.

And I'm not sure I'm interested in maintaining a data migration module, so I'm not suggesting that the contrib version of CCK is where that functionality would lie. I'd like to see it in its own module and move CCK for D6 to this model (no changes once data exists) to make core CCK more reliable and easier to maintain.

OK, one more question which is not about the field structure per se but which affects it. What happens if you disable one of the field modules for fields that are currently in use. This happens sometimes and your field data is then in limbo and potentially subject to loss since the field module is not available to handle its data. The field is not available so the field's data won't show up on forms and won't get saved when they are saved. Do we protect that data or let it be wiped out? Do we alter the schema to remove those fields? Do we lock that content type so no changes can be made until all its fields are enabled? Very messy.

A Plan

KarenS's picture

Yched, Earl and I had a long IRC conversation about this and I am going to try to document our thoughts:

Yched's biggest concern about locking fields is the difficulty if you need to later share a field that you had not originally planned to share, like in a version 2 of your site. We discussed that doing that kind of thing would be the role of a contrib module.

Once we agreed to the concept of locking field structure down, he thought about all the other Things That Will Become Fields (tm), like taxonomy and files, where you need to be able to retain the ability to share them later with other content types without knowing ahead of time that you will do that.

So we came up with the idea that some fields will always have a separate table, whether or not they're shared, like vocabularies and files. Plus we'll want to have a way for you to keep open an option for a field to be shared in the future, with some sort of checkbox in the UI and some related option in the API. Those kinds of fields will always get their own separate field tables, even when they are not multiple and not currently shared, a little different than the way we do things now, but it would allow you to do those things in the future without migrating data.

So we'd end up with a structure like we have now, a mixture of a content type table for non-shared and non-multiple fields, and separate tables for shared, multiple, and 'potentially shareable' fields. There would be some joins, but they could be limited by not opting to share fields.

Another thing we discussed is whether or not this will help us with Views -- will we have a separate join for each field even if they are in the same table. If we do things the way we do now, they will, because we give each field an alias table, and each alias gets joined in as a separate table. But it sounds like if we lock the structure down we can get rid of the alias and use the real table names, and that should save us some joins.

The last thing we discussed is whether to move to this new structure now or wait for D7. We tried to figure out what needed to be done and think about whether it could be done in the D6 cycle, and concluded we can't do it before our phase I release because it will greatly slow things down, but we could try to do a D6 version 2 later that would incorporate this new structure, and that would give people who need it a way to move to that kind of structure.

Yves and Earl can post here if I forgot anything important, hopefully I caught it all :)

Looks pretty accurate to me

yched's picture

Looks pretty accurate to me :-) Thanks for summing this up, Karen.

View query optimization

bjaspan's picture

One of the arguments against allowing fields to move from per-content-type to per-field tables is that it requires that the fields be reported to Views in a way that requires a separate join for every field even for a per-content-type table where all the fields are on a single row.

This requirement seems to now be overcome. See http://groups.drupal.org/node/11288 and http://drupal.org/node/256381. CCK can report per-content-type fields as being in separate tables but Views can deduce the truth and optimize the query to eliminate the additional joins.

Therefore, Views queries are no longer an argument in favor of forcing fields to live in one place forever.

Barry, that was a very nice

KarenS's picture

Barry, that was a very nice patch and a useful addition.

It's nice that Views will optimize the query, but I'm not sure that was the most important reason for not moving fields from one place to another. The main argument against that was the difficulty of getting all the data migrated reliably and the unpredictability of knowing where they are.

And it won't help the alternative structure where we keep each field in its own table since you can't optimize that query if the fields really are each in their own table.

Maybe I'm missing something, but I'm not sure how much it's going to help this issue.

And I'd love to get this moving forward again, so I'm definitely not trying to obstruct any attempt to make things work better.

The optimization has only

bjaspan's picture

The optimization has only one direct outcome: "Views queries are no longer an argument in favor of forcing fields to live in one place forever." That doesn't mean fields should be moveable, merely that this one argument that they should not is no longer valid.

In the short/immediate term, it means that Views involving CCK per-content-type tables will be more efficient. That justifies its value all by itself. However, I believe it also bears on fields in core.

In Chicago, I was a proponent (possibly even the proposer, I do not recall) of "all fields should be stored in per-field tables." That may still be the right choice (though I am moving away from it), but it is not the whole story. Here is my current thinking:

People cannot predict the future. From a developer's and Drupal site builder's/admin's point of view, fields cannot be static. Someone is going to create a site with a plain-text field, launch it, and later decide they want the field to be formatted. Or someone is going to create a site where Authors can have only one photo, launch it, and later decide they want Authors to be able to have multiple photos. Or someone will think that a News Article should have "Author" as a text field but later decide they want Author to be a nodereference to an Author node (e.g. "DabbleDB like functionality").

The first example maybe could be handled by "always create all the columns you might need" (e.g. always create the format column, even if unused). The second example would be addressed by always storing data per-field with a delta column. The third example is unrelated to per-type or per-field storage. It represents a data transformation that CCK cannot do today ("create a new node type for Author, find all unique Author values, create a node for them, change the Author field from text to a nodereference, and perform the correct update queries), but that we would really like to support.

For the third example, we are going to need some kind of "Transformation" or "Migration" functionality. I'm sure this will live in contrib. The Migration functionality will be able to handle some subset of the N^2 type to type conversions (e.g. certainly Text to Nodereference, perhaps "(these 6 fields) to a nodeference", and probably some others). It will know how to alter content types, create and delete tables, move existing data around, etc. etc. etc.

In other words, the Migration functionality will need to know how to do all the kinds things that CCK does now for changing fields from per-content to per-field, or adding/removing columns. So, I'm thinking we should move responsibility for those kind of changes to Migration. As far as "fields in core" is concerned, fields cannot be changed once created; but the Migration functionality will know how to reach under the covers and change things around. It will provide an API and UI for doing so.

I figure the Migration suite will start off with CCK's current code for handling field count and simple type changes (e.g. add/remove the "format" column), then we'll add more functionality (e.g. text to nodereference) as time passes. I'm imagining that Migration will provide a key API function that looks something like:

migration_migrate($old_field_spec, $new_field_spec)

and it will do whatever is needed to convert old to new. For things like single/multiple changes keeping the same field type, this function can handle it directly. For things like Text to Nodereference, it will call some other custom function for that particular transform. For things like Image to Address, it will fail because it has no custom function for that particular transform.

SO.... coming back to the topic at hand. The big advantage of per-content-type tables is improved efficiency; (we are assuming that) it is more efficient to load many single-value fields with a single join instead of multiple joins. And we've already written the code handle the per-type and per-field distinction. So why not refactor it a bit and keep it instead of throwing it out? We are going to need Migration functionality anyway, and that is a natural place to put the existing code.

Prior to my Views query optimization patch, there was no reason to think per-content-type tables offered any performance improvement over per-field tables because the major expensive/non-cacheable use case, Views queries, required joining to all the fields separately anyway. Now, Views can actually benefit from the per-content-type structure without the penalty of being unable to change the database layout without killing the Views. So there is a bona-fide advantage to preserving per-content-type tables.

Place clarify the race condition.

bjaspan's picture

Karen, you've said that the way CCK currently converts from per-type to per-field tables contains race conditions and is not really reliable. Please explain why that is. Thanks!

Fields in Core

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week