Node storage back-ends

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by David Strauss on December 10, 2008 at 2:00pm
Last updated by catch on Fri, 2009-01-09 00:54

edited to add this was posted by David Strauss

My primary interest in having fields in code is the ability to retarget our node storage back-end. (And when I say "nodes," I really mean any object using the fields API.) Relational databases -- at least the open-source ones -- make a poor storage engine for nodes.

(NOTE: Some comments on this proposal were made on the wrong node. I think moving those comments to this node will require whacking the database directly.)

Some of the problems with relational databases for nodes are:

They don't provide a way to natively view nodes directly in the database without tons of JOINs.
They can't index across tables, which is a problem when your node storage spans many tables.
Because the database doesn't understand nodes as objects, sharding, exporting and importing, replicating, and efficiently archiving old revisions require lots of application complexity.
Full-text search has to be bolted on.

Obviously, we still need to support node storage in relational databases to preserve broad compatibility, but it should be among multiple back-ends. We can make this possible with pluggable field storage engines. It should be possible to enable multiple engines. With multiple engines enabled, nodes would be stored in all enabled engines. If a new engine is enabled, the fields system should "index" all existing nodes into the new engine.

Initially, we might make the relational back-end required to have a stable, consistent foundation. Eventually (think Drupal 8), it would be great to be able to disable any engine as long as at least one engine with the full set of nodes remains enabled.

A storage engine would need to support certain operations:
1. Create, update, and delete nodes.
2. Load nodes by a unique identifier.
3. Querying. We would want to give storage back-ends free-reign here so they can implement innovative methods. We might standardize some querying interfaces (to use the OO terminology) that engines can implement. This would allow modules to find "an engine implementing basic full-text searching" or "an engine implementing similarity-based sort." For the relational engine, the query interface would be a combination of the Views query builder (or the DB-TNG query builder) combined with CCK's current "publishing" of fields to the query builder.

Three non-relational storage engines I have in mind are CouchDB, LDAP, and Solr. All provide powerful replication and high-performance querying with unique capabilities over relational databases. A relational shard-based engine would also be interesting and easy to implement. The guts of search in Drupal would become a field storage engine that implements full-text querying, so we could ditch all of the specialized "indexing" and "search" APIs and hooks.

My goal for the sprint would be to implement the relational engine (including the querying interface), and a simple non-relational one, like CouchDB (possibly without the querying engine).

Querying

chx asked me for additional clarification on querying interfaces. Specifically, he expressed concern about having consistent ways to use the various field storage back-ends. So, I'll explain how querying interfaces would work.

Field storage engines would provide querying through classes with interfaces. Two interfaces come immediately to mind:

Full-text querying
Filtering/ordering

Every class implementing a query interface would allow you to build a query (of some sort), run it, and get back an iterator to walk through the results, providing maximum flexibility and support for lazy loading of results and fields. As PHP allows in its OO model, engines would implement any combination of non-conflicting interfaces on their query-building class(es).

A full-text interface would allow you to specify, at a minimum, a set of keywords. Running the query on the engine would return an iterator that walks through the results from the full-text search.

A filtering/ordering interface would allow you to specify criteria for inclusion in your results (as in SQL's WHERE), and, for the items included in your results, an ordering (as in SQL's ORDER BY). As with all query interfaces, you would get an iterator back to walk through your results.

It's possible that an engine like Solr might implement both of these interfaces, allowing you to query by a combination of full-text terms and explicit, field-based criteria.

A module like Views would require having an engine that supports the filtering/ordering interface. A module like search would require an engine that supports the full-text querying interface.

My main motivation for suggesting CouchDB as a proof-of-concept engine was simply to have a second, non-relational backend built while we're solidifying the storage API. If we don't co-develop a non-relational engine, we're at high risk of unintentionally relying on relational functionality. We could write an "engine" that stores nodes as XML files on disk, and that would be sufficiently non-relational to serve the purpose of forcing API abstraction.

My other motivation is the ability for CouchDB to counter some of our largest relational weaknesses through CouchDB views. These CouchDB views can allow us to encode our slowest operations: Views (from the module), aggregated information like tag clouds, and anything else that map/reduce approaches can support through distributed computing.

That's enough on CouchDB for the sake of fields API. I'm sure I'll talk about the necessity of asynchronous convergence on consistency and such plenty next week.

Comments

Remember: CODE SPRINT

Posted by bjaspan on December 10, 2008 at 2:18pm

This sounds like a potentially good idea but it also sounds to me like it is still in the early conceptual stages. Remember that we are having a code sprint, not a design sprint. If there is a not a viable, core-worthy design in place, we are not going to do it at the sprint, and we are certainly not going to use up our limited sprint days theorizing on a design that might not be implemented or implementable next week.

Remember: If we do not end the week with working, useful code ready to commit, we suck.

Who? Too soon.

Posted by Chris Johnson on January 9, 2009 at 5:04am

The posting is written in the first person, but I have no idea who wrote it.

Relational databases might seem all old-hat to object-oriented thinking, but the hard fact is that relational DBs absolutely spank every object store out there when it comes to performance. (One reason might be that most object stores are relational under the covers, just like our CPUs actually using "gotos", not our high-level elegant flow-control statements and instantiations.)

Making fields data-store technology neutral is a noble goal, but is probably premature in 2009 and 2010.

Edit: now that I learn the post was written by David Strauss, I begin to worry I'm all wet. If anyone knows RDBMSs, it'd be him! :-)