Entities in Drupal 8 - Notes from BOF

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by Frando on March 11, 2011 at 7:41pm
Last updated by mikey_p on Thu, 2011-03-17 18:43

Notes from the Entities in Drupal 8 BOF at Drupalcon Chicago

Participants: pwolanin, yched, fago, catch, frando, plach, scor

Entity objects, OO in general

Keep the Entity controllers, add RUD methods (as in entity API module)
Entities will be actual classes (not just stdClass) that implement an EntityInterface, with methods that are mostly routers onto functions or the methods of the EntityController. So $node->save() calls the node controller's save() methods with $node as argument.
Note from chx: this of course carries the "classes are closed" trap. If I come up with something else that can be done uniform ways to an entity, whatever it is, say a new way to serialize (thrift? bson? etc) then how can I add that to all entiities? With hooks I can add a hook, implement for core entities and then bugger contrib entities for it. I am not against classes here as inheritance/interface makes a ton o' sense here but I am wary.
(mikey_p) 1) I think $node->save() would either be, or invoke some sort of factory for finding the correct controller for saving a node (or other entity). I imagine this would be alterable, and include some hooks to determine what factory is used. 2) I don't think anyone is proposing removing or changing the logic around node_save or the current node_hooks, rather this would be a way to completely replace node_save with an entirely different implementation, should you need to. 3) The pipeline notes from catch's core conversation may apply here as well, i.e. have a list of controllers that will each be invoked for CRAP/CRUD operations, and each one could either return, or pass to the next controller, i.e. the entity_cache controller injects itself before the main controller, and thus can clear the cache for that specific entity before passing it to the main controller to save it to disk. Or the entity_cache controller, could just return the entity (if it's cached) when loading and by doing so, bypass the default load method in the default controller that would fetch from the DB.

CRUD is CRAP

Internally, we want CRAP (create,read,archive,purge), not CRUD. The API will be exposed as CRUD, though, because everything else isn't straightforward enough. Didn't talk much about this in the BOF, though.

Components of entities

Currently we have a mess: Fields, properties (i.e. columns in the node table) and random stuff that other modules stick onto an entity in a load hook.
Many parts of Field API can be separated from the field storage system. We do this partially already with extra fields on the manage field and manage display forms. To do it properly, formatters and widgets will have to be decoupled from field storage.
Formatters and widgets want a value. A value has a type. A value can be a simple value (just one literal) or a component value.

What we envisioned would look like this:

hook_field_info() becomes hook_data_type_info(). In hook_field_type_info(), then, you set the data type and set field-specific properties, if there are still any.
Widget and formatter plugins declare what data types they support. They don't care whether the value of a certain data type comes from a field or somewhere else.
The data types declared in hook_data_type_info are used for fields, for properties and can also be used for other "random" stuff, giving it a much cleaner way to integrate.
(mikey_p) Where would the properties (and their data types) for entities be defined, hook_entity_info()?
The structure of a full entity object will remain very similar to what we have now (so $node->created and $node->body['und'][0]['value']). The entity objects may get accessor methods to make the components (fields, properties) easily accessable (e.g. $node->get('body') giving the first value of the body field in the default language). There will also be a method to retrieve the abstract information about each property and field (data type, label, ...).
With this in place we don't need a full data wrapper thing to consistently access all components of an entity and allow to retrieve information about the data types used. We considered the wrappers api in the entity api module too complex for core.

Volatile fields

We currently store volatile data in SQL. This includes node_history, node_comment_count, etc. Volatile data means data that is updated a lot.
We currently have no standards for this at all. We want to standarize.

Plan:

Add high-frequency-update fields.
Make it possible for field types to decide if to lazy load by default. Lazy loading then works over __get() magic method on the entity object.
Make it possible to update an entity by just updating changed parts. With data types as described above this should not be too hard:
- Field types can declare whether they always need a full entity for update or are OK with just id, type and the value for the field.
- entity_update_partial($entity_type, $entity_id, $field_name, $field_value) or something similar
- try to make this work without a full bootstrap

Render cache entities

We want to be able to render cache both full entities and individual components on the entity. This makes the list of node titles fast again with node titles as fields.

Workflow would look like this:

load IDs with EntityFieldQuery
Construct a render array without any data in it, but #cache set up and a #pre_render callback. We need a helper function for this.
If #cache misses, in the #pre_render callback, the full entity is loaded and the needed part is built and added as a child to the current element

Everything-a-entity

Content types: yes.
Menu links: yes.

Basically, the idea is that entity api is our primary data api. No real reasons not to do this. Performance shouldn't be an issue as many entities get cached anyway and the overhead for non-fieldable entities should be really small.

update by fago: related discussion over here: http://groups.drupal.org/node/134569

Storage

field_storage becomes storage_engine with field storage and entity storage. In entity_save() the entity is passed to the storage engine for actual saving. This means that with mongo or something the node table could be skipped completely.
The entity tables would be created by the storage engines themselves. No more node and users tables in hook_schema! Instead, a list of properties + their data types in hook_entity_info() (or a seperate hook if entity info becomes too big with that). Then, sotrage_engine_sql creates the node table based on that information.
*Pipeline for loading of entities. * Still missing notes.
Conditions in entity_load_multiple() will be gone.

Update by fago: I wrote some of my thoughts down here: http://wolfgangziegler.net/Drupal-8-Entities-and-data-integration

Comments

Quick question

Posted by tinyrobot on May 22, 2011 at 12:12am

What are the benefits of storing fields in separate tables in the database?

I guess some people might say that having composite fields (Fields with multiple values) is a benefit, but I believe that those should simply be entities.

If the benefits of fields are not overwhelming couldn't we just get rid of the concept of fields and bundles, and just have entities and properties (data_types, with widgets and formatters)?

This would simplify a lot of things moving forward toward better data handling.

What you call volatile fields

Posted by pounard on September 16, 2011 at 7:15am

What you call volatile fields are often additional data for which data consistency does not really matter (you can loose it, you don't loose the real data and for most of it you can rebuild it): I think they should be excluded from the entity storage and deported to another storage backend totally decoupled from the entity API but the business handler (entity core API wouldn't know anything about it). Example: comment count is something that can be rebuild, node history can be lost without real side effects: the whole software will be more scalable if those where using another storage backend (and data could even be droppped or fetched another way if this storage backend is down, and site would still run right).

These storage engines for volatile data definitely can be handle by key-value databases for the most, which would allow to use stuff such as Redis in parallel for this data.

I was about to post a design proposal for abstracting this need into core and make it scalable and plugable.

Plus, you use the word volatile for "frequently updated" but I'd rather use volatile for "temporary non essential data" which is not exactly the same: using another storage backend such as Redis does not make the data volatile since it's a persistent storage engine: using Memcache would make it volatile (cache is volatile in general, node count is performance denormalization so can be considered as cache, so is volatile, but node history cannot be rebuilt the exact same way if we loose it, so that's not volatile data, that's more like "additional persistent non essential data that we don't really care if we loose it for a short amount of time: site won't be down".

Pierre.

So the problem with

Posted by catch on September 16, 2011 at 12:25pm

So the problem with completely decoupling these from the entity API / entity storage is if you need to do things like order by vote count, select unread nodes etc.

This is also a problem if you're using different storage for different fields, but in practice most people use exclusively SQL or exclusively MongoDB etc. so we can definitely leave it up to people.

However I'd ideally want EntityFieldQuery or similar to work with this stuff, which means some common API needs to know about it, even if it's not baked into entities or fields specifically.

Agreed these aren't really 'volatile', I've been using that to mean updated outside of entity CRUD functions more than anything else.

Agree with the limitation

Posted by pounard on September 16, 2011 at 12:57pm

Agree with the limitation when it comes to sort or join, but there are alternative methods to achieve this. Whatever is the other storage engine, you can still use mappings on the secondary storage side to sort and/or filter this data. You do an additional remote server query in this particular case in order to save on 80% of the other requests: but once again the figure is totally arbitrary: it really depends on the site business orientation.

Example:

With full SQL:
- 1 SQL query to fetch identifiers, all sorts and filters
- 1 SQL query (multiple load) to fetch all entities
- 1 multiple cache query (object being cached or not will cause the query to be done)

With additional storage engine (optimistic scenario):
- 1 external query to sort and filter (with what you can) and fetch the identifiers and the additional data
- 1 SQL query (multiple load) to fetch all entities
- 1 multiple cache query (object being cached or not will cause the query to be done)

With additional storage engine (worst case scenario):
- 1 external query to sort and filter (with what you can) and fetch the identifiers and the additional data
- 1 SQL query to filter the identifiers (adding new sort and filters clauses that the secondary engine could not do)
- 1 SQL query (multiple load) to fetch all entities
- 1 multiple cache query (object being cached or not will cause the query to be done)

I do not say this is the best approach, but depending on the business (and business only can tell that) it can worth the shot in some pieces of the API. That's why I do support the fact that, even it does not go to another storage engine, at least this stuff remains outside of the Entity API itself.

In a perfect world, EFQ would be a specialization of SelectQuery (which it is NOT right now) on which additional module could eventually call condition() addExpression() and other DBTNG cool stuff directly over: this would solve the problem of putting these additional fields in the knowledge of the Entity API, simplify the whole API, and let the specific business code (example: comment module is really pure business using the Entity API but is NOT part of the Entity API itself) do their own queries using the EFQ without the EFQ having itself the knowledge of what this pure business module is gonna do (I hate metadata and descriptive arrays BTW).

Handling all this kind of additional data in a unified way via the Entity API seems a bit like over-engeneering to me since each module that can/will add this kind of data has a specific business purpose and will probably expect different things than the others of the Entity API; It would make you responsible of anticipating all kind of business matters that can be done with this API therefore will complexify it a lot and reduce maintenance ease.

Entity API should remain really simple and lightweight to ensure that everyone understand and use it well, it does not means it can't do complex things, it just means that it should care about its own business and not about other's business: it may become a leaky abstraction otherwise (note that I say "may" not "will" nothing is defined here).