Musings on a Data API

Posted by Crell on January 6, 2008 at 3:38am

I have been pondering the question of a data API for a while, as have a lot of people. Much of the recent discussion has focused on an Active Record approach to a data API. Now, Active Record is a very powerful architectural pattern. It maps nicely from storage to interface, it can be fairly self-documenting, and it is conceptually simple and approachable.

It is also, I believe, insufficient.

Active Record

The basic concept behind Active Record is a one to one mapping from a database record to a user-space object. That object can then be manipulated as any other object, and then saved back to the database with all SELECT, INSERT, and UPDATE statements built dynamically. Additional features, like automatic foreign key lookup or multi-load capability, are also possible, particularly in PHP 5.

Active Record, however, has a number of limitations. First is the raw one to one mapping of database fields to object fields. Very often, you want to work with data at a higher level than its database representation. That can be partially solved by adding additional utility methods to a class representing a given table or entity, but that still assumes that there is a close binding between a table and a first class entity in our data model. Take a typical Drupal node structure. While every node has a record in the node table, it frequently has multiple. Nearly every node also joins against other tables depending on the value of the type column. That's not information we can know before we load the record in the first place.

The second issue comes with multi-value fields. Multi-value fields, in SQL, are typically handled with a dependent table keyed by the primary table's primary key and either a local surrogate key or the value itself. Have a look at CCK's dependent tables for a textbook example. They are keyed by nid, vid, and a local delta value.

Classic Active Record doesn't handle multi-value fields directly well. Primarily, it is not possible to load an entire object/record with its dependent fields in one query efficiently. The primary table should return a single record, while the dependent tables could return any number of records. While it is possible to run a join on the two tables, that results in a lot of duplicate data being retrieved on each load; the primary table's fields repeat to match each entry in the dependent table. Now add in multiple multi-value fields and it becomes painfully clear that a single-query load operation for any non-trivial data structure is not possible.

That leads us to a very important conclusion: Loading a rich data object from the database will always require multiple queries per load, whether they're loaded all at once or via a lazy-loading mechanism to pull in dependent fields only when needed. Lazy loading also requires a layer of abstraction behind which the lazy loading may occur.

The third problem of Active Record is that it presumes the data lives in a local SQL database. It is, by definition, a relational-database-centric design. Data, however, is not always database-centric. Files rarely live in a relational database, and practically never in Drupal. As I write this I am working on my second project recently where the bulk of the data that I wanted exposed as a node does not live in Drupal but in a 3rd party system. In one case it was a foreign (non-Drupal) system with its own API and database connection. In the other, the data is coming in from a SOAP connection. Active Record would not have supported either scenario.

Now, that's not to say that Active Record is a bad design pattern. It is quite useful in a variety of cases, and I've written a number of implementations of it myself (to varying degrees of quality). Drupal 6 now contains the beginnings of an Active Record-style utility suite in the form of drupal_write_record(). However, for a system with the flexibility and potential of Drupal it is too limiting.

Object-Relational Mapping

If Active Record is a database-centric view of data storage, an Object-Relational Mapper, or ORM, is a user object-centric view of the world. In an ORM, the storage of an object is independent of its consumer-facing structure and interface. If it happens to map 1:1 to a database record, that is coincidental. That gives it a lot more flexibility, but also a lot more complexity.

Because an ORM is more robust, it can do a lot more optimization under the hood. Values can be moved around between tables (as CCK does) without changing its interface. Complex relationships can be modeled in a database-independent way, even tying into non-SQL systems like the file system, SOAP, etc.

The flip side, however, is that an ORM can be quite complicated for complex queries, even more so than an Active Record setup. Finding all "event nodes created by Bob in March that take place between May and July and have at least 5 attendees" is reasonably straightforward in SQL (albeit a bit verbose). In Active Record it is more difficult, but can still be done if you can access the SQL layer underneath it. In an ORM, such a query becomes very... involved. Some ORMs, like Doctrine, develop their own query language to address that problem, which in turn means parsing and interpreting a query language in user-space. PHP is not known for its string parsing and interpreting-fu. That's why SQL databases are so popular; they are.

Drupal as of version 6, however, does implement an ORM. Nodes are not built Active Record style. Hooks can add data to nodes from anywhere they want, whether it maps to a database table or not (although it frequently does). However, the ORM is "naked". That is, there's no abstraction on data access. Everything is loaded at once into a single blob-o-raw-data that gets returned from node_load(), and then it's up to modules to not bump into each other and module authors to know what to do with that raw data.

Nowhere is that more painful, in my experience, than in creating your own node form. Drupal saves the submitted form data back to the node object based on the form structure, not the node structure. Matching those up is entirely the module author's responsibility, despite the fact that the form structure and node structure frequently have no right to be the same. If they don't match up, you end up with a hybrid $form_values and $node object. That's just weird.

Thingies

Despite their additional complexity, I believe an ORM approach to be better long-term for Drupal than an Active Record approach. However, that does not preclude a hybrid approach.

What would a hybrid approach be? A hybrid approach would involve a Thingie object (that was the working name decided on in Barcelona for first-class entities like nodes, users, files, etc.) that defines multiple fields. Every Thingie is restricted to a list of fields and properties. Each field can lazy-load itself on demand. Each field is also its own object, and can either be implemented Active Record style (tightly bound to the database with lots of automation) or not (to support SOAP or XML-RPC or flat file or even calculate-on-demand fields). That is, everything becomes a self-containing CCK field.

For nodes, that would mean non-CCK fields largely cease to exist. Each field, however, would have much more flexibility and could potentially return a processed value or array. For users, the profile module as we know it gets replaced by the same structure as nodes would have, potentially even with the same field objects. Files would finally become first-class Thingies, although would be unlikely to have very complex fields. (But then, who knows?)

The ability to return complex values is important. Some modules add fields conditionally. Ideally, they should do so under a module namespaced array. That is, $node->mymodule['some_field']. With an "all CCK" model, $node->mymodule becomes an object of its own which could have its own single ->value, multiple properties, even its own methods if necessary. Some standard methods (which are more flexible than properties) would be a good idea as well for standardized access, especially for theming.

For multi-value fields, there are two options. One, all fields could be multi-value. That's how CCK works now. CCK fields have a structure of $node->field_foo[x], where x is a numeric delta. After the node has been rendered, each delta has a ['view'] property that has the rendered value of the field and then some combination of other properties for the raw data field(s). That makes handling multiple values more straightforward but can be annoying to work with when most of your fields are single-value. Two, multiple fields could be listed as separate fields. That is, if a node has a single Foo field and two Bar fields then they would be accessed as $node->foo_0, $node->bar_0, $node->bar_1 rather than as $node->bar[0] and $node->bar[1]. That would make treating each field instance separately easier, but referencing an set of fields together more difficult.

I believe CCK got this one right. The flexibility and consistency of treating all fields as multi-value is worth the more complex syntax. However, the syntax could potentially be simplified. PHP 5's Iterator interfaces could be useful here.

Objects

So far I have assumed that we are talking about a system built on classic objects (that is, objects of a class other than stdClass), not on giant arrays. Certainly, everything discussed here could be implemented using no classes at all. Any problem in programming can be solved either by procedural or OOP programming (or functional, if PHP supported that). However, certain problems are far better suited to one style or the other, where "better" could mean any of less code, more performance, easier to understand, more secure, more flexible, etc. A complex data model, such as the one Drupal now supports, is, I believe, a case where Object-Oriented principles using well-factored classes and methods is a "better" fit. As I've said before, the right tool for the right job. In this case, objects are the right tool.

That does not preclude function wrappers where a function would be more convenient. For example:

<?php
function node_load($nid) {
  return Node::load($nid);
}
?>

Until now there have been very good reasons to not implement heavy OO in Drupal; primarily, PHP 4's object model didn't offer any compelling reason to, as it was too primitive and limiting for the sort of functionality we would need out of it. That is no longer the case in PHP 5.2.

The other major factor is that Drupal's current architects come from procedural backgrounds, so Drupal is mostly procedural, so it attracts procedural thinkers, etc. That is not inherently bad, of course, but at the same time we should not allow that legacy style to hold Drupal back any more than we allow legacy code to hold us back.

Another advantage of a good (emphasis on good) unified Thingie system is that it would make it easier to create new Thingie types. Sometimes a one-off project could benefit from a readily-available data API that is not nodes.

Variants

In Barcelona, Jeff Eaton and I spent a fair bit of time bouncing around concepts for an OO Data API. The main idea that we struck upon was the concept of "variants". A "variant" was an alternate version of a given Thingie. For instance, Revisions of a node are variants of it. For a file, at a trivial level different resolutions of an image are different variants. However, that could easily be generalized. Picture a single File Thingie that is a YouTube-esque video. One variant of it is a high-res version. Another variant is a low-res version. Then there are small, medium, and large thumbnails of it in JPG format. Then there's an English, Spanish, and Chinese transcript, as PDFs. And so on.

For nodes, revisions and translations would be the obvious variants. Variants would also need to stack. That is, one could have node 12, revision 2 of the English version and revision 4 of the German version, but all with the same node id. That would eliminate one of the largest annoyances of dealing with content translation currently: keeping track of nodes that are translations of each other.

Each variant would have to have a default, especially since ideally we want variants to be defineable by modules rather than hard-coded. So for instance, one could load a node like:

<?php
Node::load(12, array('lang' => 'es', 'revision' => 2));
?>

That is, load node with id 12, the Spanish variant, the revision 2 variant, and default for all other variants. If lang were not specified, it would mean the English variant. If revision were not specified, it would mean whatever the most recent revision is.

However, that quickly runs into a problem of storage. While a nice API concept, it makes the SQL model more difficult because you then have potentially variable primary keys. That gets ugly, fast.

Perhaps those should be hard-coded...

Rendering

Although one always wants to keep data logic and display logic separated, rendering a Thingie for display is perhaps its most common use. That means we need to consider how such an API would be used for rendering. The natural knee-jerk reaction would be something like this:

<?php
$node->render();
?>

It would also be very wrong. For one thing, it's very brittle. What happens when you want to render as a full node vs. teaser? As an RSS feed item? As a JSON response? Render different node types differently? Very quickly you end up with a whole bunch of flags on the render() method, and we're right back where we started.

More fundamentally, however, it is pulling rendering into the data object. That means there is poor separation between data access and the presentation layer.

Rather, we should take advantage of the fact that, in PHP 5, objects are cheap and passing objects is cheap. The data access object itself should remain pristine, but can then be passed around to other functions and objects at nearly no cost. We can also take advantage of the strict, rigid structure of the Thingie object as a collection of fields. That is, we don't need to render the Thingie! We render its fields and aggregate them for the theming layer.

<?php
function theme_node($node, $style = 'html', $mode = 'full') {
  foreach ($node->fields() as $field) {
    $return[$field->name()] = theme('field_'. $field->type(), $field, $style, $mode);
  }
  return $return;
}
?>

Grossly over-simplified, but you get the idea. ($style would be things like "print", "html", or "rss" while $mode would be "teaser", "full", etc.)

Ready, Set, Go

One of the big drawbacks of an ORM, particularly one built on lazy-loading is the risk of the "SELECT 1+N Problem". That is, imagine the following code:

<?php
$books = find_some_books();
foreach ($books as $book) {
  print $book->title .', '. $book->getAuthor()->name();
}
?>

Assuming that Author and Book are two separate entity types, we will execute one query to find the books we want (with its title), and then for each book we will execute another query to load the corresponding author object with the author's name. That is a performance killer. Generally, there are two alternative approaches:

Greedy-Loading: The process of loading a Book or Books will also load all related data, period, with whatever joins or additional queries are needed. That means it can write more efficient queries, but it's also loading far more data than it probably needs. Do we really need to also load all of the Publisher information? Doubtful. This is actually what Drupal does currently, although it doesn't even optimize multi-object loads (unless you're using a List View, which then acts as a very optimized but complex ActiveRecord-ish loader, not an ORM).
Semi-lazy loading: If you know in advance that you're going to want the Author object as well, you can specify that the ORM loader should also pull that data. That is, you write a JOIN, but you don't use SQL but some ORM-specific directive, usually a very verbose one. As I see it, that loses much (though admittedly not all) of the advantages of a lazy-loading ORM: Don't make me think. Then if you need to pre-load against multiple other fields, well, pretty soon you have code so complex that you may as well just write your own SQL and be done with it.

And neither handle multi-value fields, of course.

I would propose a third, more flexible approach; let's take a cue from both jQuery and CCK. A jQuery object (the $() thing) is never a single entity. It is always a set of DOM nodes, sometimes a set of one DOM nodes. In CCK, a single-value field is, structurally, the same as a multi-value field that only happens to have one value. Views are all about multi-object queries, although those can also frequently have only one entry. They're still treated as multiple entries.

Set operations are both extremely powerful and very natural. SQL itself is built on set operations. Many RPC requests either assume or allow set operations. There is, in fact, no reason why a single-entity operation cannot be implemented as a set operation.

OK, not entirely true. If you have a naked data structure, then you never have an entity but an array of entities, which you then need to foreach(); that can be clumsy if you don't practically use multi-value fields often. That's another place where having an OO interface is "better" (in this case easier) than a procedural one. You can encapsulate the loop, much the way jQuery plugins do. That makes the module-developer-facing API cleaner. This is a good use-case for Interfaces:

<?php
interface Something {
  function doSomething();
}

class Thingie implements Something {
  function doSomething() { ... }
}

class ThingieSet implements Something {
  protected $thingies;
  function doSomething() {
    foreach ($thingies as $thingie) {
      $thingie->doSomething();
    }
  }
}

$t = new Thingie();
$t->doSomething();
$s = new ThingieSet();
$s->doSomething(); // Does something to every thingie, even if there's just one of them.
?>

But why do we even need a loop? Moving a loop inside a function call makes the API cleaner, but especially when dealing with SQL databases doesn't necessarily help performance. Remember, though, that we're demanding a rigid Thingie structure. A Thingie is composed of Fields. Let those Fields handle their own set operations, especially for lazy loading. That means each field needs to know what its "load set" is. That is, suppose we have a Set of Node objects. When we access, say, the taxonomy field on one node, it should lazy-load all taxonomy data for the nodes in that Set in a single query and then assign out the data to the appropriate node's field. Update operations then can also be grouped, making it possible to load a set of Thingies, set one field on all of them, and then save, with the load and update operations both taking only one query each no matter how many Thingies we're modifying, and without writing any SQL ourselves.

I've implemented such a system recently for a 3rd-party integration project, and it is actually not difficult to implement. (Well, the read part, anyway. The write part is more complicated, but I believe it is solvable.)

While not as efficient as explicitly specifying the fields we want, it does flatten the cost. That is, the Book and Author example above would have a cost of 2: one to mass-load books, and one to mass-load authors. The worst-case scenario for such an approach is that every possible field gets called anyway, or 1 + F. That is exactly the cost of a node_load() in Drupal now... except that node_load() is actually N*(1+F) because it doesn't do set operations.

Searching

The other major drawback of an ORM is that non-trivial searching is frequently very verbose. We need to specify a variety of weird search parameters in a completely data-agnostic fashion. That means lots of methods, or lots of objects passed into a long method signature with lots of arguments, such as in Qcodo. Or it means a custom query language like Doctrine. Both are icky.

To be honest, I see the best course of action here to be "punt". Separate loading from searching entirely. A set-load operation should take simply an array of IDs. How those IDs are found is a separate question. They could be from a simple API call, or looked up from a field in another object, or from a Views-like builder, or from a direct SQL query builder, or a hand-rolled 15-line super-complex SQL query (which would be about 1/10th the size of trying to write that query abstractly with fully-OO directives), or retrieved from a remote server via SOAP. By keeping searching and loading separate, we maximize flexibility while minimizing complexity. While there will be, in some cases, a cost in terms of additional queries, that cost is relatively flat and, I would argue, a good value.

Now, combine flexible searching with everything-as-a-set, and you have a very large portion of Views not just in core, but being core itself.

What not to do

There are, of course, a lot of potential pitfalls in any such endeavor. There are two I will mention in particular.

Deep inheritance. Deep inheritance hierarchies are well-understood as being dangerously brittle. Adding functionality requires editing existing code. You can't do things "sideways", as Drupal does nearly everything. If you want a certain piece of functionality to be part of two classes that otherwise have no reason to be logically children of each other, you're SOL. That's not to say that inheritance is always bad, but it is very easy to get carried away. Let's avoid that.
For example, it may seem natural to have a Node class, and then child classes of Node for Page, Story, Event, etc. However, that's not how nodes are actually used in Drupal. A node is a node is a node, but it has some combination of fields on it (even now). Rather, a Node should continue to have a ->type, which in turn indicates what field objects should be added to a generalized field container.
Code generation. I have never understood how the "Don't Repeat Yourself" and code generation design philosophies can coexist. Code generation (speaking of logic generation here, not function skeleton generation) is repeating yourself. In fact, its repeating yourself so much that you make the computer repeat yourself for you. That's a sure sign that you are doing something wrong, and need to refactor your code to actually use inheritance, or composition, or something other than typing out the same code over and over again. Of course, there is one catch here: PHP 5.2 and earlier does not support late static binding, which means, in essence, static methods can't be usefully inherited because references to other static methods will always be to the defining class, not the actual class. Fixing that is a hotly debated topic for PHP 5.3, but that's still well-off as far as we're concerned.
Tight-coupling. (Nobody expects the Spanish Inquisition!) A truly pristine data API will require fully and totally divorcing it from the rest of the system, particularly forms. The node edit form's save routines, for instance, should not in any way shape or form know anything about SQL. They should all simply set values on the $node object, and then eventually $node->save() gets called and fields internally save themselves. Similar separation is also necessary for theming. (Like, can we get comments out of node_view(), please?)

The future

What I have described here is, of course, ambitious. Very ambitious. It represents a fundamental shift in the way Drupal handles data. However, I believe that shift is necessary for Drupal's long-term success.

Dries has set some rather lofty goals recently. Web services. Drupal as the "Linux of the web". He doesn't think small, and neither should we. A robust, powerful, and above-all decoupled data API moves Drupal from being a CMS or application framework to being something else: A data server. The menu system acts as a simple router to handle RPC/REST requests, which the data server responds to. It just so happens that the most common REST request is for an HTML page; at least for now. In 5-10 years, though? Will we even be talking about SQL and HTML? A good data API should be agnostic of both.

Implementation

So what would such a system look like in practice? How would we use hooks with it? I'm glad you asked! I'm not entirely sure yet, aside from the samples above. I have some ideas, however, most of which owe thanks to other members of the Drupal community. For instance, courtesy of Jeff Eaton (who may credit it elsewhere himself) is hook_nodeapi() going away in favor of hook_node_info() defining what field types a given node type has, and then those field types (objects) internally do whatever loading and saving they need. Adding fields to a node type is accomplished by a simple hook_node_info_alter(). hook_user() would likely suffer a similar fate.

One of the possibilities discussed in Barcelona was to implement just files using a new Thingie-based system, as a trial balloon. Despite recent improvements, Drupal's file handling is still a weak point. That means the risks of experimentation are lower than if we were to break, say, nodes. If it works out, we can then extend it to the rest of the system in the next future, and we'll have a good API behind which to implement things like, say, pluggable storage engines.

At this point, I welcome feedback from all and sundry on the concepts presented here. Truthfully, I do not expect it to all happen in Drupal 7. If it all happens by Drupal 8, it will be aggressive. However, we do need to have a clear picture of where we are going and how to get there, even if it will take a few versions.

Comments

Very good ideas

Posted by chx on January 6, 2008 at 2:12pm

One of my biggest concerns were that we always needed more than release to get stuff right. Having a rial with something that's already more or less broken is a very good idea.

Also, having 1+F queries instead of having N*(1+F) seems like a good and serious boost. The search part, however, needs a bit more explanation.

Can't wait to see you and code this.

Not convering searching

Posted by Crell on January 6, 2008 at 6:32pm

The search part is deliberately separate. That is, the ORM itself has only the following:

<?php
Thingie::loadMultiple(Array $ids) {
  // return a ThingieSet object  
}

Thingie::load($id) {
  $set = Thingie::loadMultiple(array($id));
  return $set[0];
}
?>

How one builds that array to feed into loadMultiple() is an entirely separate question. Trying to cover every possible use case with dedicated methods and constants and parameters is way too hard, and would only result in ridiculously complex-looking code anyway. So, punt. (Reference to American football where the team with the ball kicks it far down field to the other team so that the situation becomes the other team's problem and they can change strategy.)

In tune with the future

Posted by ximo on January 8, 2008 at 10:46pm

I've read the whole text and understood some of it :)
All in all, I'm positive to the direction Drupal seems to be taking.

A robust, powerful, and above-all decoupled data API moves Drupal from being a CMS or application framework to being something else: A data server. The menu system acts as a simple router to handle RPC/REST requests, which the data server responds to. It just so happens that the most common REST request is for an HTML page; at least for now. In 5-10 years, though? Will we even be talking about SQL and HTML? A good data API should be agnostic of both.

I couldn't agree more! The future of the web is semantic, so this change is not only ambitious but spot-on. This is what may set Drupal apart from its competition in the future, and where others may fail.

grin

Posted by Boris Mann on January 9, 2008 at 12:25am

We are hitting Book + Author right now...and it sucks. I was digging into some of Eaton's old noodlings and code on relationship API. Regardless, the noderef stuff needs revamping and connecting into Views.

Very good to see this thinking. But, aargh!, sitting here on 5, and it's not in 6....how do we get there in interim steps? Apply it as a drop in replacement for files in 6? Views 3 for 6? CCK 2.0 for 6?

xD

Posted by lunaris on January 9, 2008 at 4:28pm

I found this really interesting and have thought along your lines for a while (though I would be lying if I said I had considered it in this depth :) Thinking about it though, do you think this would entail an almost complete rewrite of Drupal? It seems to me like a lot of things couldn't coexist in their current form; for example I am worried that after implementing this beautiful, clean ORM layer, hooks would look "ugly."

Excuse me if I am talking complete crap - but I would very much like to contribute to this effort, both through discussion and coding :)

Peaceful coexistence

Posted by Crell on January 10, 2008 at 5:55am

Well... It would involve rewriting any data-storage parts of Drupal, which is a sizeable chunk of it. It would indeed be a FAPI-sized undertaking at least, which is why I said spreading it out over multiple versions is likely.

That doesn't mean many things couldn't coexist, however. The menu and page callback system would be almost completely unaffected, for instance (although they'll likely evolve at the same time in their own ways). There's nothing wrong with hooks, and in fact toward the end I did propose using hooks to define how the ORM layer will work. I wouldn't say hooks are ugly, just not always the right tool.

The fundamental problem is that right now, Drupal is horizontally very modular, that is, modules can add new functionality anywhere. However, it is not very veritically modular. That is, data storage, business logic, and rendering are far more intertwined than they should be. The goal here is to separate out the data storage layer entirely (or as much as possible) so that we can free it from its current business logic intertwinings and vice versa.

And Drupal is always being completely rewritten, just not all at once. :-)

SImilar System?

Posted by mfer on January 10, 2008 at 8:19pm

Recently I've been digging around in other systems ORM layers. None of what I've see would be quite right for drupal. Do you know of any systems that have an ORM that's similar to what you are envisioning this would be?

I'm a big fan of examples, even if they are only similar in a particular way that needs to be showcased.

Matt Farina
www.innovatingtomorrow.net
www.geeksandgod.com
www.mattfarina.com

Writing it

Posted by Crell on January 16, 2008 at 9:32pm

I don't know of any 3rd party ORMs along the lines I'm describing. The closest model is the ORM I designed for a site I'm working on right now, which I alluded to in the article. I'll see if I can get that to a demoable point. (It has some rough edges and is read-only, so it's not a complete model.) No guarantees at the moment.

interesting!

Posted by fago on January 17, 2008 at 11:25am

that's really interesting, thanks for the write-up! I like your ideas, in particular the "ThingieSet".

I think that another important point would be to unify the available hooks per "thingie".

Currently, each module has two support each thingie explicit. E.g. CCK does a great job for nodes, but why isn't it available for comments and users? Because the comment user is system is way to different and supporting it would require a lot of extra code.

I'd love to see CCK providing it's features in a generic way suitable for all kind of thingies. To make this possible I think a concept like "Interfaces" of the OO would could be useful. Each "Thingie" defines which interfaces it implements. Then modules like CCK build upon thingies that implement the required interfaces - so it could seamlessly support users and comments too.
Of course this would also require an overhaul of the hook system, as e.g. something like hook_nodeapi shouldn't be limited to nodes, but to thingies that implement e.g. "Customizable".

So modules could go and implement their stuff in a generic way, which I think is really important. So it would be a final and imo good solution for the "everything should be a node" debate.

Everything as nodes

Posted by mfer on January 17, 2008 at 12:36pm

You are essentially opening up the door on the everything as nodes conversation. Or, at least with the ability to do things to them in the same way as nodes. Wouldn't it be nice comments and users could have all the same things happen to them as nodes? If only the conversation were that simple.

While it would be great to have these abilities there is a rather large cost. Nodes are expensive when it comes to resources. If comments were nodes my personal blog would need to use a lot more resources to serve a post. The cost has been great enough that the drupal leadership has opted to not give other objects the same abilities and nodes.

If someone can come up with a way to lighten the load on nodes quite a bit I'm sure we could change things up. But, I'd study up on this conversation before joining in. There is a lot of background to learn.

Though, a consistent data api would make things easier.

Matt Farina
www.innovatingtomorrow.net
www.geeksandgod.com
www.mattfarina.com

where's the beef?

Posted by moshe weitzman on January 17, 2008 at 2:46pm

with all due respect, your post adds little to the conversation.

no

Posted by fago on January 18, 2008 at 7:53pm

no, I'm not talking about everything as nodes.
I'm talking about making it possible to code for all "compatible" Thingies at once.

Missing the point

Posted by mikey_p on January 19, 2008 at 6:31am

I'm with Moshe.....

What this whole subject isn't about, is making everything into nodes. At least that isn't the reason this is all being done. Rather its more an issue of standardizing the method of handling data, whether it be node, comment, file, or user. That's the whole point, is making the back-end handling of data more agnostic of the type of data, so that the same code can be used to return nodes, comments, users, files, whatever.... This does not inherently introduce extra 'load.' Rather the examples Crell has outlined show how this could be used to reduce unnecessary database queries by standardizing on data handling.

The main point is that this A) totally changes the way you should think about anything in terms of 'load', comments, nodes, users, whatever and B) actually makes the Everything-is-a-node vs. No-its-not arguments a moot point. Nodes, comments, and users will have no more or less in terms of data handling than any other node, comment, or user. It really levels the playing field between them. So if a user wanted to turn comments into node-like data structure, they probably could, and do the whole thing as efficiently as it is done now, if not more efficiently.

Missed the point

Posted by mfer on January 22, 2008 at 4:32pm

Point taken... I obviously missed it. :-)

Happens sometimes.

Matt Farina
www.innovatingtomorrow.net
www.geeksandgod.com
www.superaveragepodcast.com
www.mattfarina.com

Doctrine

Posted by jscheel on January 31, 2008 at 3:20pm

One of my co-workers (actually, the guy right next to me), is the #2 man in the Doctrine project. I must say, seeing how powerful Doctrine is from his word, and using it myself in a few projects, I am a firm believer that integrating Doctrine into Drupal could be a huge leap forward. I, as well as Jon (the Doctrine dev), would be fully committed to placing Doctrine in as the dbal/orm for Drupal 7.

Data definition, Query API & Caching

Posted by miro_dietiker on August 11, 2008 at 10:30am

Very interesting thoughts about future evolution of drupals data handling. Thanks for sharing it with us.
I'm always concerned when i see plain SQL in the code the current drupal way...

However there are three major additions in my opinion i'd like to bring in.

Data definition
As long as modules are playing in the local database, the data API should also provide a formal definition of data structures.
If you get it right, the API will being able to create the Tables without a line of code itself. It will also be able to check tables integrity and do upgrades. The internal drupal module upgrade implementation could even handle field alteration and column renaming.
For sure, upgrades which need having an update function due to content changes must be possible by interface calls.
I'd provide something like an XML data definition per module to be rendered into the systems data definition on install.
This would also provide a huge amount of improvement. Something Query builders like project Views could then simply base on this data relation definition, being extended where optionally needed.

Search / Query API
I'm not sure if i understood you right, but the "list of ID thing" is the wrong way in my opinion for a loader.
I'd suggest to introduce a loader API to do Joins on the fly before rendering the final SQL. You don't need very complex capabilities and dynamic SQL query rendering from some object structure representing the search is very neat.
Thus we would need no PHP memory for thousands of IDs when PHP needs to walk through many nodes but only a rule definition.
The search object (before rendering into SQL) could therefore be passed to any hook or any module to be modified.
E.g. an author join in a node query could simply be done by any module/hook or rendering participant and therefore reduce cost.

Caching
I've found no words about caching in your post. Caching is a very complex thing: The more complexity and modularity you add into wrapper and data access, the lesser the probability to find a caching implementation to speed up.
A solution building on plain abstraction and code-reduction would always result in SQL fetching.
The earlyer you introduce caching in design (Lookups, Object representation) the better it will be possible and speedup.
I'm not yet sure how to do this perfectly but there's a lot of potential. Possibly not on the data API itself but on the rendering level. Even when the data API has no caching inside it needs to support caching invalidation to get rid of outdated cache elements.

I hope my words where clear enough and i really hope some of those advanced techniques will find a way into drupal.

Been thinking about this

Posted by snelson on September 30, 2009 at 11:32pm

Been thinking about this again lately. Searching around for any recent developments regarding Drupal and ORM. Besides Database API, has this discussion gone anywhere since this post? Any info or links much appreciated.

This by Them

I have recently put up a new

Posted by brendoncrawford on November 29, 2009 at 5:24am

I have recently put up a new project which provides an ORM-like interface for Drupal Nodes and their corresponding CCK fields. It is working well in D6, and I have a D7 in progress. The project is still young, but it is in fact working. If interested, you can see it at http://drupal.org/project/orm/.

-brendon

Wow! Very intresting module!

Posted by denisanokhin on December 26, 2009 at 12:03am

It seems to be a very usefull module. I am going to test it right now. Thank you for it. Drupal really needs to have powerful ORM.