BADCamp code sprint summary

Posted by gdd on November 15, 2011 at 10:36pm

I had several conversations around CMI at BADCamp a few weeks ago, which I have attmpted to summarize here.

First off I had a long conversation with yched about how to implement fields in the config api. There are several aspects to this that are tricky.

Field API : Config files layout

We are talking about a file layout like this:

field.<field_name>.xml (field definition)
<entity_type>.bundle.<bundle_name>.xml (bundle definition)
<entity_type>.bundle.<bundle_name>.instance.<field_name>.xml (instance of a field on a specific bundle)

A nice thing about this is we can get to the bundle metadata without loading all the instances too. While these names are going to get long, they allow us to do things like ask for all the instances on a bundle without introspecting into the files. The different ways in which these might be queried was a topic of conversation several times through the weekend. It may be that at some point we will have to iterate through files to get data out for queries, like right now field_read_fields() and field_read_instances() are just big query builders around the {field_config[_instance]} tables, but obviously we would like to avoid this.

One thing the CMI API needs that came out of this discussion is the ability to list all config entries with a given prefix, but only one level deep (e.g if I want to list all bundles in an entity type, I don’t want the “instance” files).

Apart from fields and instances, Field API also stores properties about bundles, currently in the {variables} table (yes, “ew”) : list of view modes using dedicated display settings, and order/visibility of “extra fields” - e.g. title, poll results... - in forms or displays. The new location for those is not fully decided yet, but probably somewhere under .bundle.*.

Right now instance definitions contain all display settings for all view modes regardless of the one you need, causing memory bloat. Yched and swentel have been considering the introduction of “displays” as standalone animals, collecting display properties for all components in an entity (fields, extra-fields, contrib components like field_groups...) for a given view mode. Those would probably live in .bundle..display..xml. Possible clash with the current CMI concepts: granularity for module defaults. The current vision is that modules can provide default config files that will get copied into the actual config store on install. If I’m a module that adds its own fields and instances, I’ll provide the xml files for the fields and instances, and I’ll also want to provide the corresponding parts of the display file. That issue is possibly not limited to Field API : what if I want to provide a default for the ‘page cache’ property without affecting the other properties stored in system.performance.xml ?

Field API : active fields, deleted fields - “state”, not “definition”

One of the oddities of the field system is that when a field is deleted, it isn’t really ‘deleted’. It is marked as deleted, and then the field values are deleted over time through cron. Once all the data is deleted, then the field definition can be deleted.
Also, Field API needs to track which field/instance definitions refer to field types that are currently absent (i.e. the providing module is disabled), and keep them out of the way.

Those informations (‘pending deletion’, ‘active’) are currently written directly in the {field_config} tables. We decided that they are not part of the actual definitions living in the config system (e.g. we don’t want them hand-edited), but pertain to the “state” of the site (along with “cron timestamp” and “search index progress”), and therefore should be stored somewhere in the database - but *not in the {variables} table :-).

API needs : A generic key-value store for what we would currently put in {variables} and doesn’t fit in config files ? (but Field API can use a dedicated table of its own meanwhile)

The absence or appearance of a new file will signal that the instance (bundle, etc) should be deleted or added. This means that we will need a way to get the current list of configuration and compare it against the files in the file system to determine what is new and missing when a load operation is triggered.

Note : the numeric, serial $field[‘id’] used to track fields (i.e. allow “delete field / immediately create new field with the same name while the 1st one is still pending full deletion”) will probably need to be replaced with a UUID for deployability.

CMI : the “from files to active store” flow

This then led us into another discussion on Sunday with davidstrauss, yched, sun, crell, beejeebus, webchick and michaelfavia (and maybe some more.) The main focus of this discussion was how are we going to respond to config being loaded off of disk (i.e. config files deployed or edited by hand) in situations when a module needs to do something more complex than just reading data out of the file and into the active store (for instance, when creating a new field or bundle.) Here is what we came up with.

A reload of the active store from files is triggered. This can happen either through the admin UI or through an alternate mechanism like Drush.
You are shown the list of files that have been changed, added, or deleted, and asked if you accept these changes.
If you say yes, then the config system starts calling a hook in every module in the system, in order based on the directed acyclic graph (http://en.wikipedia.org/wiki/Directed_acyclic_graph) of the dependencies as defined in the module .info files. This should ensure that processes happen in the correct order, for instance, if a bundle has been created and then a field instance is added to that bundle, the bundle has to be created first.
This function gets passed an object which points to the entirety of the new config. The old config can be accessed through the existing config system if need be.
The modules look at the new config, compare to the old config in any way they care, and act based on it.
When all this is complete (or potentially as it happens), the new config completely overwrites the old config in the active store.

The important breakthrough in this conversation was that the config system should not track who owns or needs to act on changes. It is up to the modules themselves to act however they need to based on whatever criteria they have for their use cases. This reload operation will also have to happen on any save() or delete() functions, because it should be triggered whenever the active store is written to. This means CRUD API functions on config objects (e.g. field_create_field()) act merely by writing the new config on disk and triggering a “config reload”. The actual action (e.g. create the storage table) takes place in the ‘config reload hook’.

In theory this system could be modified to handle situations wherein the ordering provided by module dependencies isn’t enough. For instance, say a module is called but it can’t complete its task for some reason. This task could be added to the bottom of the graph to be re-called after the original graph has been completed. As long as the graph keeps shrinking you keep running through it. If it doesn’t shrink for two iterations you can throw an exception.

Implementing that flow is the primary blocker before Field API can try to start moving over to the config system.

Comments

Those informations (‘pending

Posted by catch on November 16, 2011 at 2:17pm

Those informations (‘pending deletion’, ‘active’) are currently written directly in the {field_config} tables. We decided that they are not part of the actual definitions living in the config system (e.g. we don’t want them hand-edited), but pertain to the “state” of the site (along with “cron timestamp” and “search index progress”), and therefore should be stored somewhere in the database - but *not in the {variables} table :-).

API needs : A generic key-value store for what we would currently put in {variables} and doesn’t fit in config files ? (but Field API can use a dedicated table of its own meanwhile)

This is exactly what http://drupal.org/node/1175054 is for. I had been planning to work on that issue since I've spent a tonne of time on the current variable system trying to work around the fact we currently use it both for configuration and state (and individually fixing contrib modules and some core subsystems that abuse the hell out of it even given that dual use). But... have not done so yet :(

The important breakthrough in this conversation was that the config system should not track who owns or needs to act on changes. It is up to the modules themselves to act however they need to based on whatever criteria they have for their use cases. This reload operation will also have to happen on any save() or delete() functions, because it should be triggered whenever the active store is written to. This means CRUD API functions on config objects (e.g. field_create_field()) act merely by writing the new config on disk and triggering a “config reload”. The actual action (e.g. create the storage table) takes place in the ‘config reload hook’.

Avoiding magic in figuring out what needs to respond to what sounds good, but this brings up new (and old) questions for me:

It outlines the need to keep state tracking out of the config system - this sounds at least as expensive as variable_set() is currently.

If it triggers a reload operation, what happens in this situation:

you have some changes in files that have not been loaded into the active store yet.
you create a field via the UI or similar which triggers a reload.
do you get a choice to only reload the parts you changed in the UI? Or do you now need to reload everything all at once?
also the example given is creating a new field, but presumably a full reload has to happen on updating a field as well? Does this mean you can't have things like 'overridden' views any more (since it will always get written back to disk)?
some time ago there was discussion about the disk store not being required, sounds like it went back to being required again?

Genereic key-value store

Posted by yched on November 19, 2011 at 1:24am

Yup, I keep an eye on http://drupal.org/node/1175054 :-)

However, giving it more thought, it's very possible that Field API will use its own 'state' tables anyway. As mentioned in the OP, field_read_[fields|instance]() are currently query-builders that let you say "give me all the fields of a given type", or "give me all the instances of that field", and I'm not sure we can do without this "queriability" - my patch for smart static caching of field and instances heavily relies on it.

So I'm thinking of keeping trimmed-down versions of the current {field_config} and {field_config_instance} tables, with only the queryable columns (id, field name, field_type, entity type, bundle, deleted, active...). Those tables are derived (and rebuilt) from the definitions in the config files. Querying them only gives you a field name / id, and you need to go through the config system to get the full definitions, but most of the time you only want a field name (which is partly why the current field_read_field() function sucks).

Or is that a slippery slope - putting config-derived data back in the db ?

do you get a choice to only

Posted by gdd on November 23, 2011 at 9:53am

do you get a choice to only reload the parts you changed in the UI? Or do you now need to reload everything all at once?

Right now you need to reload everything at once.

also the example given is creating a new field, but presumably a full reload has to happen on updating a field as well? Does this mean you can't have things like 'overridden' views any more (since it will always get written back to disk)?

Yes, I believe this whole concept of default/overridden configuration is going to go away entirely. That is the approach I am taking in terms of image styles anyways.

some time ago there was discussion about the disk store not being required, sounds like it went back to being required again?

It is not required but it will be on by default, with an option to turn it off. There are still some arguing that it should be off by default (most notably beejeebus).

read on import, write on export

Posted by beejeebus (not verified) on November 25, 2011 at 1:09am

just to clarify reading/writing to disk - i'm arguing that reading only happen on import, and writing only happen on export.

and writing only happen on

Posted by yched on November 25, 2011 at 12:41pm

and writing only happen on export

What's "export" ? I don't think I heard about that notion of "export" so far in CMI.

CMI before BADcamp has always been "config changes are saved to both active and file stores".
The approach taken at the BADcamp sprint (field_create_field() and friends write straight to disk, and hook_config_reload() triggers business action by diffing files and active store) shifts that motto to "in 'some cases' (TBD ?), config can get written to disk only" - which I try to challenge in my "Reload flow" comment below.

But "export" is still new to me ?

All in all, it doesn't look like we have a clear vision of the config flow right now :-/

export == dump current state of active store to disk

Posted by beejeebus (not verified) on November 26, 2011 at 2:32am

AFAIK, even if i fail to get my way here, we're still going to allow site admins to turn off the 'write to file on every change' setting, and in that case, files on disk can get out of sync with the active store.

so, if you decide you want to edit some files and then get the config reloaded, you prolly want to 'export' the current state, then edit, then import.

so, if drupal shops that use version control for CMI in D8 want to publish new config from testing --> production, they prolly want to export production and bring it in to testing as part of the release cycle.

hope that makes sense, and you can see that writing config to disk all the time is not at all necessary for either use case, given we're going to need an explicit 'dump all config to disk' no matter which way we go.

"Reload" flow

Posted by yched on November 19, 2011 at 2:16am

A couple thoughts that came to me after BADcamp or while helping heyrocker with the summary above.

4. [hook_config_reload() or whatever name] gets passed an object which points to the entirety of the new config.

I guess this object lazy-loads the config files you explicitely ask for ? Because loading the entirety of the config files in memory sounds like a no go.
D7 currently loads all active fields and instances in memory on each request, and this is a real issue for sites with many fields / bundles / entity types (a D8+D7 patch is on the way). So we can't be talking about loading all fields, instances, image styles, views, panels, rules, ... on a 'config reload'.

I'm also worried that hook_config_reload() is left alone with "the new state of the files" as a whole. This means that field_config_reload() needs to iterate on each existing field and instance, load old and new state, and deep-compare them to find out if there was an update. Really ? And the same for each subsystem whose config objects need to react on an update ? (e.g. image.module needs to flush the corresponding folder when an image style is changed). The config files have signatures, so it looks like the config system could at least point us to the files that actually changed ?

This means CRUD API functions on config objects (e.g. field_create_field()) act merely by writing the new config on disk and triggering a “config reload”. The actual action (e.g. create the storage table) takes place in the ‘config reload hook’.

That's the approach that came up in BADcamp when I asked "so what should field_create_field() do exactly ? Is it the function that writes the config, or is it the function that gets called when config files have changed ? when do I write the config, when do I take my business action ?".

Now I'm not sure this really flies, especially given the considerations above about the memory/CPU cost of config_reload().

First off, sounds like this makes UI operations sluggier, since they issue a whole config_reload().
Then, doing a series of CRUD operations is not unusual :
- creating several fields and instances in a hook_module_install()
- Submitting either "Manage fields" or "Manage display" forms means one field_update_instance() on each field instance in the bundle (to update weights).
- deleting a node type can mean 10s of field_delete_instance() and field_delete_field(), removing an entity type can mean 100s.
Does each one trigger a separate config_reload() ? Please no.
Also, this approach means that UI operations write to the file store but not to the active store. That's a non-minor shift from the current approach, so far it's always been "UI changes will get written to *both* the database table *and* the file system so the two are in sync" (in several places in the "Config management architecture" doc), and the current API in the CMI sandbox only saves conf to both stores.
Would we have different flows ? :
- "simple" config (variable_get()-like) saves to both stores
- "complex" config (field_create_field()-like) saves to file store only, and moves over to the active store through config_reload().
Seems confusing, and I'm not sure where we'd draw the line.

Alternate, stupid proposal: each CRUD function on a config object needs an additional $write_to_config param.

<?php
/**
 * API function, called by other modules or by Field UI.
 */
function field_create_field($field, $write_to_config = TRUE) {
  // Take the business action : 
  // Create the storage table, etc etc...

  if ($write_to_config) {
    // Save the corresponding config (both active and files)
    $config = the_corresponding_config_object($field);
    $config->save();
  }
}

/**
 * Implements hook_config_reload().
 */
function field_config_reload() {
  if (I figure out a new field needs to be created) {
    field_create_field($field, FALSE);
  }
}
?>

- This lets us keep one single "save config" flavor (save to both stores), and the only time the file and active stores are off sync is when files have been edited/deployed and are pending reload.
- An API call doesn't trigger a full config reload. Full config reload is only after deployment. Then it can be as ineffective as we want (sort of)

Maybe even : if we make all config objects derive from a base ConfigObject class, maybe we could have a standardized way, just by looking at the incoming $field, to know if we should write it to config or not, and ditch the extra param.

Config objects with CRUD will need a UUID

Posted by yched on November 19, 2011 at 1:07am

About the need for $field['id'] to become a UUID :

Although $field['id'] only exists in D7 because of Field API's "delayed deletion", a deployable ID will be needed anyway besides the field name in order to disambiguate the following case :
- create field_foo
--> deploy
- delete field_foo
- (possibly weeks later) create a new field named field_foo
--> deploy
Without an ID, we can't tell between "update" or "delete + create new".

This is critical in the case of Field API, because the latter means data loss. Maybe other config objects will be less sensitive to that (Views possibly doesn't care between the two ?), but I guess Field API won't be alone in this case.

It doesn't necessarily mean that the config system should handle those UUIDs for us, but I'm wondering if the API / concepts can somehow help avoiding that each developer gets his own headslap moment when they realize the above subtelty (there might be other nasty combinations ?). Maybe it's just a matter of documentation.

UUID

Posted by Crell on November 23, 2011 at 5:42am

Adding a UUID to eliminate that problem entirely seems much better to me than documenting it and hoping people don't reuse names. Especially in deeper deployment pipelines I could see that biting someone hard.

It wouldn't really be

Posted by yched on November 25, 2011 at 12:52pm

It wouldn't really be "documenting it and hoping people don't reuse names", but rather "documenting it and hoping maintainers of subsystems defining config objects with a CRUD cycle read that doc, figure if their own objects are affected, and if so add UUIDs themselves and rely on them when tracking config changes".

Again, I'm not sure to what extent the Config API can streamline this. Most probably this should be (if anything) a 2nd phase enhancement once we're we have an actual, working config flow in place.

questions

Posted by beejeebus (not verified) on January 9, 2012 at 6:33am

"1. A reload of the active store from files is triggered. This can happen either through the admin UI or through an alternate mechanism like Drush.
2. You are shown the list of files that have been changed, added, or deleted, and asked if you accept these changes."

we figure that out by reading the list of all file sigs, then creating a new sig for every, single, config file on disc? are we going to lock all changes to the active store during a reload, so that the comparison the admin agrees to and what actually happens are consistent? so any code that calls config->save() once a reload has started should fail, and that would be a feature, right? or, we kill the reload if we detect a change to the active store, and force the user to start again?

or we're going to allow the admin to accept changes, but then end up with values in the active store that are different to what they agreed to?

also, what does accepting those changes look like? will it show just the file names that have changed? or the names and values of our completely unfriendly machine-name-readable keys that have changed? or a diff-like output of lines of XML/JSON? both?

a single file can contain multiple config values. so if we update at the scope of a file, a config value i don't want to change can be overwritten or deleted when trying to introduce some other change in the new config file. how do we handle that? we'll offer value-by-value granularity of which changes should happen? or, assuming you even spot the issue, you quit the reload, redo your config file changes, then try again?

"This reload operation will also have to happen on any save() or delete() functions, because it should be triggered whenever the active store is written to. This means CRUD API functions on config objects (e.g. field_create_field()) act merely by writing the new config on disk and triggering a “config reload”. The actual action (e.g. create the storage table) takes place in the ‘config reload hook’."

wait, what? what if the files we're going to write to have changed because of a deployment or ftp or editing or because someone enabled a module that added a field to a content type? we can't reject all of those cases in code via a lock(s) around our config operations. so we can read in values from disk that are different from those written by field module "merely writing the new config on disk". or a few different combinations, depending on when you read the sigs from the backend, or generate a sig of the file on disc.

and this window of fail is present for normal site building operations.

BADCamp code sprint summary

Field API : Config files layout

Field API : active fields, deleted fields - “state”, not “definition”

CMI : the “from files to active store” flow

Comments

Those informations (‘pending

Genereic key-value store

do you get a choice to only

read on import, write on export

and writing only happen on

export == dump current state of active store to disk

"Reload" flow

Config objects with CRUD will need a UUID

UUID

It wouldn't really be

questions

Deployment & Build Systems & Change Management

Group organizers

Group categories

Tags

New groups

Group notifications

Hot content this week