Drupal Data API Design Sprint, Day 1

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by nedjo on February 5, 2008 at 5:49am
Last updated by bjaspan on Sun, 2008-02-10 03:32

Day 1 focused on scoping the challenge of introducing an abstracted set of data APIs into Drupal core.

Participants:

Yves Chedemois
Larry Garfield
Barry Jaspan
Karoly Negyesi
Nedjo Rogers
Karen Stevenson

Regrets:

Moshe Weitzman (ill)

The overall challenge was defined as: to fully convert Drupal into a web application development platform in which a powerful CMS is one built-in implementation. CCK fields provides a very promising set of solutions on which to base our work, yet also presents several important challenges. In our discussion, we:

Reviewed the current state and key features of CCK fields (see attached PDF slides - Yves)
Identified key features of the new data API
Reviewed a sample OOP implementation, drawing implementation details and implications
Generated a provisional set of terminology definitions
Generated a list of some of the thornier questions we’ll need to tackle to implement CCK fields in core as a first step toward meeting the key identified features, and reviewed several items on this list.

1. Introductions

We began with a round of introductions, giving some background on how each of us started using Drupal, our experience and current involvement.

2. Initial scoping: What are we trying to accomplish?

To get going, we had a broad discussion of challenges and aims. This discussion sketched out some ideas and problems for later analysis and refinement. Here are some of the observations that individual participants made:

Many past improvements evolve as separate modules that get simplified and then bolted into core, without actually going back and changing the way that the rest of core works. (Example the actions module adds lots of new functionality that could maybe replace hooks.)
We can't add this as yet another way to do things, we should get rid of other ways of adding data to nodes and have just one way.
We need to separate out data access system from other parts of system. Currently, for example, forms and nodes are far too tightly bound.
The community would love us if we delivered fields in core. But it’s clear we don’t know what that means. We will have to answer a lot of other questions first. In implementing CCK fields in core, we’re going to hit a lot of other things in the context (data storage, etc.).
In looking at CCK, a lot of what’s there exemplifies the broader data API aims we’re talking about. E.g. separation of widgets and fields. We have separation of storage and input handling. The content module really is a layer over the whole thing. CCK is a flawed model of what we’re talking about.
Our task is finishing the transition of Drupal from a CMS to an application framework or development platform.
Core is very uneven, and as we proceed it may become even more uneven. There is a CMS layer. If we want to deliver this, certain things should be possible from a user interface. A lot of things don’t fit with this. For example, a recent problem is that the menu system can handle any size of a menu, but we can’t feed this into a drag-and-drop interface. What we’re doing now is going to drive this even further. Certain things are not going to be possible without disabling certain parts of core. If you want a very big menu currently, you need to disable the menu module. So in the past we have suggested to Dries to remove this or that from core but he is reluctant to do that. We need to design in a way that Drupal remains a CMS, but if certain things are impossible we need to remove them.
The direction Drupal should go in is being a web application development platform. Without that, Drupal has no future. If what you want is a blog, you’d be an idiot to use Drupal. In 30 seconds you could use WordPress. The CMS should be the example implementation of the framework.
Node API and Forms API are so intertwined that many things are difficult or impossible.
CCK originally was developed to answer a limited need--fields defined through an admin UI. However, this aim has imposed on CCK a set of constraints and design requirements that require a higher level of abstraction than the standard nodeapi approach, in which nodes can be extended through arbitrary SQL. The result is an ironically unbalanced situation, in which admin users have this rich set of tools (CCK fields, views, etc.), that are not available to developers. With CCK, users have better tools than developers. They can easily add and alter their database, plug in form elements, and format the results. Developers still have to build most of that by hand. For Drupal to survive and thrive, developers need better tools.
Much Drupal development has been done through a "brilliant individual" model. In this model, a clever person gets an idea, often develops it outside of core, maybe with two or three others collaborating, and then tries to strip it down to bolt it onto the existing core. At its best, this approach produces a set of solutions superior to what existed. But often it fails to basically rewrite what already exists. A current example might be Actions. Brought into core, actions provides a new and arguably superior way of implementing much of what previously was done through hook implementations. Yet all those hooks still exist. They weren't replaced, they were added to.
Drupal is at a point in its development at which we need to find group or team led models that enable more fundamental changes.
In bringing CCK fields into core, we cannot afford merely to bolt an additional option onto what we already have. Rather, we need to provide a new, higher solution that becomes the basis for, eventually, totally replacing what we already have.

3. Presentation: Current state of CCK field development (Yves and Karen)

[See attached slides below]
CCK is a huge amount of code. What are the key parts? If we move some parts of this into core, what do the users expect? What features are critical?

Currently in CCK

Field definition, update, deletion, plus db storage.
Fields can be stored in separate per field tables.
There is code for db normalization. Issues in this code include: race conditions, schemas have changed.
Widget construct: Major reworking from D5 to D6. There were pre-FAPI remnants.
CCK widgets need to be aware of data structure. e.g., a text field needs a value and a format.
Previously field modules needed to handle their own multiple values. Now content module handles this, which has enabled e.g. an improved approach to ‘add more’.
Admin UI to reflect the field crud operations plus some display settings depending on the rendering context.
Allow field modules to opt in or opt out of default behaviours.
Fieldgroup module, originally outside CCK core. Some of the code here is aging.
Views, some early Panels 2 integration.
First initial set of unit testing.
Widget code is now reusable. E.g., node reference widgets draw on option widgets.
Various contrib field types. They are of various qualities. Some of these are really textfields with special validation rules, and at present there is not an ability to add just the validation rules to existing field types.

CCK in core, what could it mean?

Some potential conflicts between the UI and e.g. hook access. What happens if a module defines a field instance? It may be that the field itself is fixed, but the instance data can be tweaked.
The CCK field definition UI has been seen in the past as a barrier to bringing CCK into core. If this is left in contrib, that problem disappears.

CCK database storage

A more thorny issue is the code needed to handle the migration between different storage models. This code is messy and also seen as problematic as it introduces uncertainty as to what data storage model will be in place at a given time. The fact that we have a dynamic schema sets CCK aside from core and the rest of contrib.
We may need to choose one model, and store every field in its own table. Each field therefore would have its own table with an object id, version id, and delta (even though the delta is not effectively used in non-multiple fields). Later we might find an improved db optimization approach.
A problem with a separate table for each field is the increase of joins or separate queries. Lazy load may reduce the need for these queries.
We don't know for sure that it is less efficient to store each field in its own table, and so far CCK has not gotten any efficiency out of it anyway because every field was loaded separately (that changes in the D6 version).
Simplifying this by always creating a table for each field would make it easier to get into core.

4. Presentation: Sample implementations (Larry)

Larry reviewed sample applications, including an OOP implementation of active records without code generation.

If we’re going to be using lazy loading, the magic get methods are useful.
Set operations may be a useful way of approaching things when we need a set of nodes.
-- We may be able to write more efficient code if we allow for the possibility of loading more than one node at a time instead of always loading a single node.
-- How many nodes will we need at once? It can't be too many if we will use 'WHERE IN ()' as our SQL. The logical places to do this would be on lists and in views, and there we would probably using paging so not be loading more than 20-50 at most.
Search is a separate operation that returns IDs. Then they are loaded by ID.

5. Key features of a reworked Data API

Functionality is API driven.

Data access layer

storage agnostic
bidirectional (CRUD)
single set of methods for all entity types
is not tied to any other subsystem (form, node, user)
default implementation is SQL

Rendering layer can render any data in any way (HTML, JSON, XML-RPC etc)

Input layer can be inferred from knowledge of data structure.

6. Working definitions.

Field/Datum: An individual piece of variable information.

A widget to edit
Validation function
Storage function (not yet discussed)
Load function (not yet discussed)
Formatter

Property: An individual piece of meta data about a Entity. Same across multiple variants vs. the field which varies against subtypes..
Entity: A collection of multiple Data and related Properties that can exist in multiple variants.
User: An Entity representing a user with access to the system.
Node: An Entity representing an abstract piece of content.
File: An Entity representing a binary object.
Data API: A unified storage and retrieval interface for Entities.

7. Key problems in planning a CCK-field-based data API

what 'entity' types should e.g. receive fields?
sub-types vs. not: need to attach to entity types
global IDs?
field vs. properties
field vs. nodeapi(‘load’)
multi-source data
what’s in node module?
characterize node load/save behaviours

What object types should e.g. receive fields?

Problem statement: We

Nodes and some other entities (users? files? comments?) should be top-level entities with a consistent way to manipulate them.

Sub-types vs. not: need to attach to entity types

Problem statement: Currently, CCK fields are attached to entity subtypes (node types) rather than directly to entity types (nodes, users). If fields can e.g. replace user profiles, we need to decide: can fields attach directly to an entity type? Can non-node entity types have subtypes (like node types)?

Can all top-level entities have subtypes?
-- Nodes can, we already have that
-- Comments can, example that YouTube now has both text and video comment types.
-- Users may not have multiple types

In our current model, subtypes are (a) the thing modules use to set variables to decide whether to operate on that node. e.g., fivestar applies to stories but not to pages. (b) defines the set of fields that e.g. CCK is going to apply. (c) defines access options.

Are content types still needed? Are they a legacy carryover since they used to tell you what module created them? Are they really just different templates of the same type of entity? They are now a shorthand for the collection of fields that makes up the subtype. Presenting the entity either for edit or for view can be themed per subtype. While it would not be impossible to store the structure per entity and then load/validate/save etc based on that it seems unnecessary complexity. [chx]

Global IDs?

Problem statement: given uniform methods, e.g., entity_load(), and the ability to attach fields to different entity types, for storage methods we need some way to globally identify an entity instance. How?

Four options:

Global object id: a single table that generates keys for all objects.
- presents severe legacy data issues, as all existing nodes, users, etc. would need to get new IDs
Table naming convention that identifies the entity type. Tables are type_field_foo
Two-field key: type, id
Three-field key: type, id, vid (assuming everything is potentially versioned)

Of these, the first seems the least desirable. Two might be best, with 3 and 4 (really versions of the same) are also options.

Fields vs. properties

Problem statement: CCK fields have the potential to replace much of what currently is handled as hard-coded form elements and SQL statements. Are there however some properties of entities that are properties rather than properly fields? If so, what is the distinction between properties and fields?

What about comments -- could there be multiple comment 'fields' on a node (which would be true if they're fields)?
Should fields go only on content? Users aren't really 'content', although they may have content associated with them. Users are an access control mechanism.
All fields must potentially be multiple, CCK got that right.
Body field vs CCK textarea, some fields have special meaning -- other modules may do various things with the body field.
What is a property vs what is a field?
--Fields can have revisions
-- Fields can vary from one content type to another, properties would always be the same.

Fields vs nodeapi($op = 'load')

Problem statement: Data objects currently can be arbitrarily extended through hook implementations. CCK fields introduces a more structured approach, in which e.g. fields must be introduced into a registry. Should the two coexist? Or should we move to a single system?

General feeling is, eventually, we should focus on only one method of adding data to a node, and it should be the new fields method.
Need to address: What would be implications of getting rid of nodeapi 'load'? What else is it used for?
Validation and error checking should happen in field not form because this must work the same for module-based transactions as form-based transactions. Core is not doing this consistently now -- the check for required fields happens as a form element validation step.

Attachment	Size
CCK overview.pdf	39.93 KB

Comments

nice writeup

Posted by moshe weitzman on February 5, 2008 at 3:41pm

Thanks so much for writing up all these thoughts.

i have a problem with 2. What are we trying to accomplish: defining the problems. In summary, it lacks a can do attitude. It defines the problems as so deep and so broad that only a fool would choose undertake this effort. Fields in core does not depend on convincing the world that Drupal is an application development framework instead of a CMS. Further, it seems to say that fields in core is such a massive talk that requires every other subsystem to be rewritten. I really don't like that approach. More specifically, I think it is not pragmatic. You can't just submit a patch which changes all of drupal all at once, even if it is a good patch. Our minds can't grasp the implications and folks will be uncomfortable and worried and thats enough to stall the patch. We've all seen this many times. Actions module rightfully didn't change the nodeapi/user/comment/term hooks. Those hooks are good. They let us experiment with alternate models of reacting to events. There is no reason why a nodeapi_load op can't coexist with fields, especially for a release or two. We'll need to refine fields once they get. Further, "fields in core" has nothing to do with Drupal switching to a team based development model (whatever that means). I have no idea what the point of that bullet is. When you introduce dependancies like this you reduce the fun factor. Who wants to join an effort that is frought with non-technical, poorly specified, unsolvable problems. This paragraph reads like a manifesto.

I suspect that the attitude of What are we trying to accomplish: defining the problems reflects the author's mood at the time of writing, and not really the attitude of the team (or even the author). Lets see if we can focus on the positive during the rest of the sprint. We can do this, as long as this is narrowly defined.

So far, I have just contributed these couple of paragraphs of criticism to the effort. Not so helpful. I'll contribute something more positive today.

And again, I congratulate the team on all you are doing, and wish I was there. I have confidence in you.

I added some clarification

Posted by nedjo on February 5, 2008 at 4:29pm

Thanks for the note Moshe. This whole post is more minutes or notes than a digested public presentation. It's "what was said" rather than "here is the plan we've agreed on and want to present". The 2nd section is notes taken directly from discussion. I think you're right, read as considered conclusions it didn't work. I've changed that section's title and given it a brief introduction, pls see if that helps. We're not sure yet, I think, if this is a useful public post.

thanks nedjo

Posted by moshe weitzman on February 5, 2008 at 6:55pm

the intro helps. i think this is fine as a public post as long as the following days have more positive reports.

lots of little patches

Posted by catch on February 5, 2008 at 7:25pm

Although the main bulk of the api would have to be one big patch. I don't see why converting much of core to it couldn't be done in a lot of followup patches. Table drag and drop got to nearly everywhere relevant in core in the same way. Same with quite a few schema api patches.

Separate validation

Posted by rolodmonkey on June 5, 2008 at 5:08pm

Some of these are really textfields with special validation rules, and at present there is not an ability to add just the validation rules to existing field types.

This is a really important point. Modules like CCK and Webform often have plug-ins that duplicate a lot of work. Validation really should be a separate module with plugins for the different types of validation you might want to do on a field. That way, I could choose my widget: text, select, whatever and then add one or more validation rules to it, for instance: required and integer and range. Most validation plug-ins wouldn't be more than a single regular expression.

Learn more at iRolo.net.

excellent starting point

Posted by liquidcms on June 13, 2008 at 7:02pm

Great writeup. A couple points from some of my past posts that are somewhat related:

more powerful widgets:
When i first started with Drupal a couple years ago i was surprised to see the lack of validation/manipulation widgets for CCK that i had come accustomed to working with some other form building PHP frameworks. I tried to explain this on the forums but i don't think i was able to get my point across very well.

Basically an admin defining a field should be able to add from a set of validation rules and manipulation actions to that field. And module designers should be able to extend this widget set (there should also be a php code box to do you own).

Example: "upper case first letters" - you should be able to assign as a manipulation case to a field - so that even if user enters "peter", "Peter" gets placed int he db. The framework i had used in the past had numerous rules/actions available to assign to fields when building forms.

As powerful a tool as CCK is, it is really only capable of creating fairly simple forms - unless the user codes their own. The answer i got a couple yrs ago was: "of course you can do validation on a field - you just need to code a new field with validation code".. hmmm.. nope.. that would be wrong.

everything's a node:
pretty sure i read that somewhere when i started working with Drupal.. incredibly powerful concept that i am continually explaining to clients how important it is to their site - BUT.. then.. users aren't nodes, comments aren't nodes, taxonomy isn't nodes, etc, etc

Of course their are individual modules to convert all these to nodes - comments, taxonomy, users but wouldn't it be great if these things were native to core??

I think this work is going to help with both these ideas and move towards a much stronger application framework.!!!

Peter Lindstrom
LiquidCMS - Content Management Solution Experts

Peter Lindstrom
LiquidCMS - Content Solution Experts

See

Posted by yched on June 13, 2008 at 8:41pm

See http://drupal.org/node/212665 for the proposed outline of a 'custom validators' feature. This was proposed as a GHOP task, but is still open for any taker :-)
It could possibly be extended to include your idea of 'manipulators'.