Configuration management sprint - file formats

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
gdd's picture

Welcome to another discussion from the configuration management initiative! So I believe that in the last thread we pretty much got everybody on board with the idea of 'We can use pure JSON as our file format if we use a hashed directory name and don't write the files by default.' This is great, however it then led to more discussion of 'Why are we using JSON anyways? Why aren't we using PHP/YAML/INI/LOLCODE etc.' Some good points were brought up in this discussion and I think it is worth hashing this out now.

NOTE: please leave aside, for the moment, the idea of pluggable formats. We will need to choose a default for core and contrib to ship with, and that is what we should be focusing on for this discussion. The option of pluggable formats, and how to implement them, can come another day.

So as background to this discussion, the day after the code sprint was over at DrupalCamp Colorado, a large cadre of very active Drupal developers (me, webchick, chx, damz, davereid, cwaegans, eclipsegc, crell, probably more I'm forgetting) met to hash out the question of what file format we will use for the configuration system. It seemed obvious at the time that PHP and JSON were the primary contenders (with the exception of XML which found a strong advocate in EclipseGC) but at the end of the day we all agred that JSON was our best option. However, for the sake of completeness, we considered a wide variety of options. I took notes, and have transcribed them below.

Format Pros Cons
YAML
  • Pretty printed by default (it is the nature of the format.)
  • Stores mixed data formats (aka objects and/or arrays.)
  • Interoperability with other tools (native to Puppet.)
  • Very human readable
  • Parser not included in PHP by default, we would have to find one.
  • Performance impacts of a userspace library.
  • Relies on .htaccess protection to protect it from being world-readable.
XML
  • Can store mixed data formats (aka objects and/or arrays), but only if we come up with our own standard for storing them (EclipseGC suggested we could use XSLT to do transformations of this data.)
  • Lots of open source tools available for working with it.
  • No default pretty printer, we'd have to add one. (Is this true? I have it in my notes but now I am thinking that SimpleXML does it.)
  • Will either have to use SimpleXML (slow, but included in PHP by default) or an external userspace parser like possibly QueryPath or SAX (faster but not included by default.)
  • Even then we will have to write our own wrapper to convert the XML to appropriate data formats aka arrays, objects, etc.
  • Larger memory footprint.
  • Relies on .htaccess protection to protect it from being world-readable.
PHP
  • Can be read and parsed very fast.
  • Can store whatever data formats we want.
  • Not pretty printed by default but we can just use drupal_var_dump().
  • Natively executable so we don't have to jump through the security hoops we would with the other formats.
  • Lack of interoperability with existing tools.
  • If we include() it, it will seriously mess with sites running APC if there are a large number of files.
  • Alternatively we could read and eval() them, which is very very slow.
  • When we read in a config object, its memory would not be freed. Whereas with the other formats, they would die when they fall out of scope.
JSON
  • json_decode() and json_encode() included in PHP by default, highly performant.
  • High level of interoperability.
  • Offers ability to easily interact with client-side tools for browsing and analyzing configuration (aka who is overriding what values where.)
  • Human readable (although not as much as YAML.)
  • Cannot store mixed objects and arrays, also some limitations with binary/utf8 data.
  • Cannot add comments (only format for which this is true.) We would probably get around this by implementing our own metadata in the structures.
  • No default pretty printer, we'd have to add a userspace one.
  • Relies on .htaccess protection to protect it from being world-readable.
INI
  • They are a well understood, reasonably readable standard.
  • Native parser in PHP.
  • Can only store arrays, not objects.
  • While there is a native parser, there is not (that I could find) a native encoder/writer.
  • We would now have INI files as well as .info files (which are sort-of-but-not-really-ini-files) which could cause developer confusion.
  • Relies on .htaccess protection to protect it from being world-readable.

Note that we did not actually discuss INI at the time (I brought it up and everyone just groaned) but I included it here for the sake of completeness.

So if you have anything else to add to the pro/con for any format, please do bring it up and we can keep this post updated. Please remember we have the following goals:

  • The format must be performant to encode/decode, especially when rebuilding all config objects after a clear and reload. Please provide actual benchmarks when discussing this, we need data not general impressions or links to random blog posts.
  • The format must be at least reasonably human readable/editable.
  • We want as low a memory footprint as possible, and this means as small a file size as possible.
  • The ability to interoperate with other tools easily is a huge plus.

Remember that we will have to weigh all these items against each other, and make a decision based on priorities. This is not a popularity contest or a vote-counting thread. Argue your case and make clear points. A simple +1 to any format doesn't help anyone or add to the cause.

Thanks everyone, we all really appreciate the input.

Comments

I don't mind JSON, because it

Josh Benner's picture

I don't mind JSON, because it is relatively simple, interoperable, and non-executed. I think lacking comments is a big deal.

PHP is like having home field advantage, but it loses interoperability. Furthermore, having Drupal do include() (or eval) on files that it writes seems like an invitation for security problems. If malicious content is written to PHP files that are included, it doesn't matter if we have .htaccess protection or if the files are outside docroot, since a bootstrapped Drupal is including them for execution. So, instead of security in depth, all it takes is a single vulnerability for complete compromise of a site/system.

Drupal writes out settings.php -- but it only does that once, and we consider a writable settings.php to be a threat.

Now, I'm sure a security team member would be better suited to provide some perspective on whether this is really a concern. Is a vulnerability that leads to malicious executable code in PHP config files feasible/(un)likely/impossible/whatever?

Any web-server-writeable

gdd's picture

Any web-server-writeable directory would offer this same vulnerability. Just because we store JSON in there by default doesn't mean a hacker who gained access couldn't just write out a .php file in the first place. This is where the signing came in (and I believe it is a good idea for it to remain.)

It's almost the same. There

Josh Benner's picture

It's almost the same. There is the extra (potential/probable) layer of .htaccess in that case, whereas there is no potential for such an additional protection for PHP directly executed by Drupal itself.

We could actually still

gdd's picture

We could actually still prevent the PHP from executing just like in the files directory, because the current plan is not to execute these files, but to read and eval() them.

.htaccess prevents direct

Josh Benner's picture

.htaccess prevents direct execution by browsing, but read+eval() is still executing code written by Drupal. The benefits of that approach over include()ing are:

  1. we can easily throw a signature line in there
  2. (as you noted) it won't muck up APC cache

The signing would help reduce the threat of arbitrary malicious code being executed when Drupal calls eval() on file contents, since we could likely detect tampering. This addresses my concern with include()ing.

So, for read+eval(), the only concern left would be whether or not its speed is problem?

EDIT: Further down excellent points made about include() and eval() both having serious memory-usage concerns.

eval() is bad, include should

pounard's picture

eval() is bad, include should definitely be replaced by include_once(). PHP file magic is that APC will keep it memory and won't stat it. If you want to use PHP but not rely on this, then PHP should definitely be thrown away.

Pierre.

JSON has comments

montyd's picture

Hi Josh

I'm the author of a CMS which has used JSON for its configuration management for a year. My friend Kyle found this old (from late 2005) post from Douglas Crockford himself on comments in JSON: "A JSON encoder MUST NOT output comments. A JSON decoder MAY accept and ignore comments." Read more from Kyle on this topic at his http://blog.getify.com/json-comments/

People have loved using my utf-8 encoded JSON with comments because it is so simple, self-documenting, and easy.

Monty

I have a couple of questions

andremolnar's picture

I am getting the impression that there is some kind of requirement for page load runtime parsing of configuration formats. Is that the case and if so why?
Related to the first question I get the impression that there is no plan to aggregate and cache configuration (since there is mention of the overhead of loading many many configuration files etc.).

Would assuming aggregation / compiling / caching of configuration for optimal run time execution change how people think about this question of file format?

My understanding: The

Josh Benner's picture

My understanding:

The original proposal put forward the approach that configuration would be loaded from defaults in modules, and written to a central file store that represented "current" configuration (called Layer 1). Atop that, a cache of the configuration would be stored in the DB (Layer 2). Layer 2 is what would be read-heavy, acting as the source of configuration data during a typical page load, only written when changing a setting or explicitly rebuilding from data in Layer 1. Layer 2 would be written on any config change (so, both read and write on layer 2 are non-frequent).

While performance of Layer 1 is unlikely to be a general Drupal performance bottleneck, I see it as an Admin UX as well as a DX issue -- imagine having to wait annoyingly long when changing each setting while building a site! So while I don't believe Layer 1 has to be blazingly fast, speed is a virtue and should carry some weight.

Correct me if I'm wrong!

There's a couple of issues, I

catch's picture

There's a couple of issues, I thought I was clear on the runtime loading until I discussed it with heyrocker, now it seems a bit more open than I thought.

Non-runtime:

I'm assuming that when defaults are copied from modules, and/or Level 1 file storage is copied into Level 2 storage, there will be a need to scan and parse many configuration files at once (let's assume the worst case which is all of them). If this is happening, and can be triggered by the UI, then it needs to be 'fast-enough' (i.e. more than a second isn't going to be pretty submitting an admin form), but more importantly shouldn't cause memory bloat - since it happens alongside other large rebuild operations like menu rebuilds, theme registry etc. it's going to raise the floor of memory usage requirements for Drupal again (current variable_init() is not great either of course). It's not the most important thing, but a lot of Drupal has been designed with "this only happens on a cache rebuild", and then cold starts look really ugly now so I'm pushing for that to be borne in mind.

Runtime:
This is the bit I'm not sure if I understand or not. The level 2 store is pluggable, but there has been some discussion of storing the encoded JSON as a string in the database, fetching it from the database and json_decode() on runtime as the default. While I'm sure the json_decode() is fast enough (although for example eval() or possibly a custom parser for some other format might not be), my main worry would be multiple database round trips to fetch different config objects during the same request - assuming that multiple different modules may be checking their configuration while building pages. Also that config objects are going to contain keys that weren't requested, and defaults that have never even been set - the latter is not a problem with the current variable system and could cancel out the memory gains we're expecting to see otherwise.

So assuming those worries aren't unfounded, I would want to see us lazy caching the configuration i.e. pull the full config object out from the level 2 store first time it's requested, figure out which keys are being requested, then cache the object with just those keys - in a single cache entry alongside other config objects - we might have more than one cache entry for the site, but only one would be used on each request. This means we need to confirm that the following is not going to be lossy:

$foo = format_decode($string);
$bar = serialize($foo);
$baz = unserialize($bar);
$string = format_encode($baz);

I don't see how it could be - any format should be more restrictive than serialized PHP object which are pretty flexible, but you never know. This is especially important if there is additional metadata attached to the configuration - I do not want the description of the admin e-mail variable to be loaded into memory every time I want the site name if that can be avoided.

This latter problem is going to exist with any format other than PHP (and depending how the PHP is stored and parsed, possibly with that too), but yeah I dunno. These are all implementation details but fully agreed it is hard to make decisions on the file format without knowing all this.

I like both PHP and

Sylvain Lecoy's picture

I like both PHP and INI.

PHP:

  • Common pattern among the PHP community for configuration, new users can understand easily and I think using standards are the way to go.
  • A sysadmin, who don't know JSON, can configure it without having to learn the drupal way.
  • Its worked 11 years like that, without any security issues.

INI: why not ? We are not storing objects, even drupal works with arrays as API. So storing the conf in arrays *.ini is not wierd and is extremely respectful of the drupal logic and implementation.

I don't like JSON because it needs special cares:

  • You need to properly configure your .htaccess to avoid preying eyes, thing that not all users can do.
  • Or you can workaround adding a *.json.php extension.
  • You need to hide the configuration folder with some hashed - impossible ot guess - directory name.
  • Or it need to be outside the webroot, thing which is not possible for all of us.

By the way, for the level 1, remember that performance is not an issue and PHP, INI or JSON, in my opinion, will be extremely close in term of performances.

Note that all of the issues

gdd's picture

Note that all of the issues you list for JSON would also exist for INI or any other non-executable format. I should add that to the pros/cons list.

Ha yes you right, .INI would

Sylvain Lecoy's picture

Ha yes you right, .INI would have the same issues that JSON.

Memory

Crell's picture

Another significant factor is memory usage, which is not necessarily parallel with CPU performance. Parsing JSON or INI (both have built-in parsers in C code) or INFO (our parser is an ugly regex) is a reasonably memory-friendly action, since you start with a string, the parser is not complex, and at the end you throw out the string. The only memory left over is the config object you parsed, which if you save it to a DB and then delete the object you clean up as well. So all of those would have an O(1) memory usage, which is a good thing.

XML has the same advantage, but its parser would, I suspect, be more memory intensive. I haven't benchmarked that I grant, but that's my suspicion at the moment. However, the same ability to "throw it away" when done still exists, so it's still an O(1) memory usage, albeit a larger value of 1. :-) (That may or may not be the the case for YAML as well, as I don't know how its parser works.)

For PHP, if we include() a file (or maybe eval() it too, not sure) then the memory used by the parsed code is never freed. Ever. It stays there until the end of the request. That makes the memory usage of PHP O(n), where n is the number of config files we parse on a given request (eg, when doing a rebuild). That is the worst possible case of any of the options here, and given how bad Drupal is with memory usage already I consider that enough reason to dismiss PHP.

catch already started some benchmarks of JSON and PHP: http://drupal.org/node/1198924

Memory use

gdd's picture

Catch just posted some test here

http://drupal.org/comment/reply/1198924#comment-4675550

that seem to confirm that memory eaten from reading and eval()'ing PHP does not in fact get freed, so I have added that to the con column for PHP.

INI vs. INFO

Crell's picture

INI files are a non-starter, too. The only support one layer of data, i.e., you get only one set of array nesting. That means we are severely limited in the sort of data we are able to store. While most modules will only need a few primitives, there will be config objects with much more complex data structures. Think Views configuration. INI simply cannot handle that.

That's one of the reasons we developed our own .info format, which is a proprietary extended-ini that supports nested array structures. I can't say I like that format, but it is able to do things that ini files simply cannot.

So I don't see ini as a viable option, simply because its capabilities are too limited.

INI files can support deep

pounard's picture

INI files can support deep hierarchy, your argument is invalid. The only problem could be that INI would store only primitive data types; But in that case (configuration) it would be heretic to attempt any other data type storage, since we are running for inter-operability.

Any complex object is a hierarchical tree of primitive ones, including Views objects configuration.

Pierre.

PHP from a hook

e2thex's picture

I want to throw out the idea of just having hook_configs and having them return the configuration.
something along the lines of

<?php
/*
* implements hook_config()
*/
function MODULE_hook_config (&$config) {
 
$conf['MODULE_config'] = array( .....);
}
?>

and then other modules could come along with

<?php
/*
* implements hook_config_#decimal#()  {
*/
function OTHERMODULE_config_111(&$conf) {
  
$conf['MODULE_config']['thing_i_want_to_change'] = 'change';
}
?>

This is a standard paradigm that I think will work very well for putting config in code

and then this can be run and stored to the db for an active state (in mongo or some other store the whole tree could be saved as one doc, where in sql maybe each top level is stored as a row)

This would also allow for contrib to really have access to the config system, and come up with way to store it in any format they want

Module-centric

Crell's picture

Just dumping mega-arrays into hooks is what we want to get away from. It has a number of serious problems:

1) Memory. The hook rebuild process for info hooks of that sort is very memory intensive. Arrays may be CPU cheap in PHP, but they are not memory cheap. A huge part of our memory problem right now, according to catch's benchmarks, is all of these big arrays we keep building and keeping around.

2) Module-centric. Not all configuration is tightly coupled to a given module. In fact most useful configuration isn't, or rather shouldn't be. Think node type configuration. A given node type is one config object, or rather it should be. We shouldn't have to jump through all sorts of hoops to allow modules to add node-type configuration for their functionality. That is a common case, not an edge case.

3) come up with way to store it in any format they want As heyrocker said in the OP, that's out of scope for this thread. Moreover, not having a standard format is suicide. Telling module developers or site builders they have to learn potentially any format someone happened to want to use is a terrible idea. Plus, as discussed in other threads different formats have different capability trade-offs, so it is not in fact possible to make the file format completely pluggable. We have a single universal standard format for our info files, and we've done just fine. Not being able to change that format for each module's personal preference has not hurt us, and neither will having a single standard config file format. In fact, it will help us immensely.

1) I am not suggesting that

e2thex's picture

1) I am not suggesting that we store this data in memory, I am suggesting that we use this system to build an active config that is stored in db. and that can be broken up so that only parts of it need to be retrieved, and cached.
maybe we do it with a two hook solution so that parts of the can be rebuild

<?php
function MODULE_conifig_info() {
  return array(
"MODULE_config_stuff");
}
function
MODULE_config_MODULE_config_stuff(&$config) {
...  
}
?>

or maybe we do it with some kind of class def (although I feel like what we really would want was just new instances of a mostly base class)
I guess what i am looking for is the ability to effect config in a natrual drupally way.

2) I do thing that any config system will have to be able to store config in modules, and I think your example of content type is when that gets used alot on distro bases sites. as an example a current site i am working on has 8 different modules with content typed defined in each (some have more then one) the idea is that this can be turned on and off adding and removing functionality.

And as config is going to have to be in modules I am not seeing why we want to create a new place for it to be stored as well? It is very likely that there will be lots of sites that have a custom_config module with most of the config export to it. Why is that a bad thing? what is the need for a super space for config?

3) I guess i was just suggesting that if this is the standard format, it makes it easy for others to do something else. but you are right he said specifically that this conversation is not about that. sorry :)

INI: no standard

chx's picture

INI is an absolute no go. There is no standard unless "whatever parse_ini_file happens to parse" is a standard in your neck of woods. Also, I am unaware of any encoder. We need to write one and hope to god we do not miss any quirk in parse_ini_file. For any other format there are existing encoders to work with and standards in most cases. The only one without a formal standard is PHP and yet the existing encoders are known to work and work well -- ctools_var_export became drupal_var_export and is battle tested over the years.

Config Objects

andremolnar's picture

Consider the following: note I don't do this to derail this conversation or even talk to deeply about specific implementation. I am posting the following to illustrate that depending on the implementation some of the pros and cons may not be pros or cons. If people want to talk about implementation details I'm sure everyone in this thread would appreciate if that were moved to a different thread. That said:

What I've been thinking about is the following

<?php
//pseudo code
Class mymoduleConfig extends Config implements ConfigInterface{
  function
__construct(){
   
//parent::hydration_method (e.g. read out of level 2)
    // if fail - hydrate from config file (any format) - using $this->set()
    //or custom hydration method
    //or a series of $this->foo = 'whatevers';
    //which self hydrates from coded values
    // or some combination of the above allowing for default values
 
}
  function
get($var){}
  function
set($var){
   
//parent::storage->write_or_whatever}
    //my own storage foo
 
}
  function
dump($format){
   
//dump configuration to whatever format on disk
 
}
}
?>

During page execution configuration objects can be lazy loaded - you only ever have as much configuration as you need. Also I think it would be very rare for ALL configuration caches to be cold (empty) at once. So you'd never be reading TONS of configuration files at once. It could be possible to have these classes auto generated from a configuration file (e.g. XML, YAML, JSON whatever) if a module developer doesn't write their own.

So if we had something like the above (or an entirely different implementation. Remember this isn't about this specific implementation) some of the pros or cons for a given format may not apply.

If that's the case maybe it just comes down to what people like the most. e.g. Human readability and hand configuration of config files becomes most important, or performance and memory usage is still most important. Its just hard to know without some thought about implementation.

I'll say again, I'm sure it would be appreciated if people want to argue over implementation details we move such a conversation elsewhere.

YAML Parsers

DjebbZ's picture

Just bringing this Stack Overflow discussion about YAML PHP parsers : http://stackoverflow.com/questions/294355/php-yaml-parsers
This list can also be found on the official YAML website : http://www.yaml.org/
Needs some benchmarking of course, like the good ones catch already did for PHP.

Seems like INI files are a no go because of the lack of flexibility in configuration data format, and PHP too because of memory. And JSON may be problem, especially the impossibility to comment (without writing more code, which is not a good idea : the best code is the one you don't need to write). And XML is memory heavy too (written in cons). So YAML remains, but I have no experience with it. That's why I posted the links for someone more knowledgeable to chip in.

Symfony YAML

andremolnar's picture

Not on the YAML.org list - https://github.com/symfony/Yaml

parser, dumper etc.

Thanks but please dont continue

chx's picture

That SO thread is very through. I think it gives us a through picture of what YAML parsers are available so thanks for it. However, debating which one to use does not belong here. It's enough to see there are existing YAML parsers userspace and C both, used widely and tested.

Edit: I have unpublished andremolnar's comment which tried to continue this conversation.

I republished the post,

catch's picture

I republished the post, whether there is a decent parser or not is a central point against YAML so just listing some possible options seems completely reasonable.

Wonderful

chx's picture

OK I am out of here. When I am doing simple moderation, I am accused of censorship and also others decide to totally disregard my efforts trying to keep this discussion in line.

Edit: it was not catch who accused me of that. I am really really sick that whenever I try to stop something bad I am left holding the bag. When I have yelled at someone on IRC who was apparently about to send around a private email I get yelled at. Here I am sick of the third thread about to go off topic and extremely few of the hundreds of replies containing anything meaningful, no research, no benchmark, no testing, no ideas, just yapping and now again I look bad. Awesome.

JSON has comments

montyd's picture

Hi Djebbz,

The idea that comments are prohibited in JSON is a myth.

More code? Wrapping json_encode is easy, advised, and the code to strip out comments is trivial.

I'm the author of a CMS which has used JSON for its configuration management for a year. My friend Kyle found this old (from late 2005) post from Douglas Crockford himself on comments in JSON: "A JSON encoder MUST NOT output comments. A JSON decoder MAY accept and ignore comments." Read more from Kyle on this topic at his http://blog.getify.com/json-comments/

Comments in JSON are not only allowed, most people like it that way.

Monty

External tools?

webchick's picture

It'd be nice to have a round-up of the external tools we're worried about being able to interface with and what formats they can read/write to.

chx mentioned Capistrano can deal with YAML. I think Puppet is another such system, not sure what it can read/write to. Anyone have some background on this?

Ruby is YAML-friendly

justintime's picture

The majority of the external tools that get brought up are written in Ruby, and Ruby is more than happy to parse YAML.

Capistrano, Puppet, Chef, and Vagrant are all tools written in Ruby.

BCFG2 is written in Python, and uses XML extensively for configuration, but can parse YAML as well.

CFEngine is written in C, and afaik, doesn't understand YAML.

Ruby is also very JSON friendly, so you can do a s/YAML/JSON/g on the above and still be valid.

YAML/JSON or PHP

boombatower's picture

The point made by chx that INI is not a standard is a very good point and seems valid for not focusing on it. PHP can store anything we need, parses quickly, and is familiar to our developer base.

XML is nice as a standard and such, but configuration files tend to be edited by hand to which XML can tend to be quite slow due to redundancy.

YAML is similar to INI it terms of format so if we like INI it would seem the way to go since it is actually a standard. As for the interoperability notes I don't quite see the point. Unless we use the exact same structure for storing certain values there is very little in the way of interoperability and even less that is useful. Seems like JSON is on a similar level to INI and YAML.

Based on that I suggest YAML/JSON or PHP.

Another strike

Crell's picture

Another thing to consider is error handling. What happens if there's a syntax error in the config file when parsing it?

With PHP and include() or eval(), the result is a fatal error.

With any of the others, assuming the parser is not stupidly written the result is a FALSE return, an exception, or some other catchable and handle-able situation. That's much easier to recover from. (Some syntax errors may still choke the parser, but they are far fewer.)

Granted, that's been the case for settings.php for years and it's not been a massive problem, but that was for a file touched maybe once, mostly by the install script and then never again except by experts in special cases. We're talking about files that would be far more frequently touched, by humans or by code, than settings.php ever was.

Good point

DjebbZ's picture

Heyrocker, can you add this to the list of pros and cons ? Error handling definitely pushes PHP out of the game.

On a side note, using external tools like Chief or Puppet means we're raising the bar higher in terms of DX for single developers or small teams without a sys admin. But for bigger projects and bigger teams, it means using industry standard tools to manage configurations and sys admins could handle this part of the project very well - am I making a wrong assumption ?

Fatal, yes

chx's picture

But this only means loading fails -- the active store won't have any errors in it. Granted, it's not nice to bomb on the user -- especially that depending on a lot of factors this can easily mean a superb informative WSOD.

Using @eval() You can get

boombatower's picture

Using @eval() You can get the result of even incorrect syntax without a fatal error.

Please don't advocate using

mike503's picture

Please don't advocate using the @silence operator. It invokes extra PHP overhead. The error is still added to the stack internally, just isn't displayed.

Of course, but considering

boombatower's picture

Of course, but considering that allows the use of PHP it makes sense. This is cached so not going to be called on a regular basis.

.info files ?

DjebbZ's picture

Why don't we consider info files ? Just a question to be sure we're looking everywhere.

isn't .info the same as INI?

mike503's picture

isn't .info the same as INI? :) all of the files I've ever seen are INI syntax, assuming [] means arrays (which as of 5.3 I believe [possibly before] is part of the INI configuration for certain PHP features like FPM)

Nope

Crell's picture

No, .info is an ini-like format created by Drupal. Its only formal specification is the regex somewhere in core that we use to parse files. I've not heard of array-ish syntax in PHP's ini parser, which also lacks an independent specification and relies on "Whatever the code happens to do this week" to define its format.

IMO whatever format we use should have an implementation-independent format specification, even if we define that ourselves. (Which would include an XML schema were we to go that route.)

Close enough.

mike503's picture

It very closely resembles (in everything I've seen) INI format enough to be considered INI (it doesn't provide anything better or worse) - so for the sake of discussion here, it should be lumped under INI.

If you want to see how PHP has used INI for arrays (which Drupal does too) - go to example #1. Starting in PHP 5.3.3 when FPM was bundled with PHP core, this was accepted.
http://www.php.net/manual/en/install.fpm.configuration.php

I don't think it was used at all up until FPM was added. I never saw it before that.

+1 YAML

Dave Reid's picture

With the listed pros and cons, I'm really pro-YAML now for config management. The Symfony YAML library looks really great, plus has the plus of being able to export objects and associated arrays natively, which most everything else cannot - so we remove a restriction on the config side. It would be heavily dependent on getting dual-licensing.

I can't remember exactly why we so easily discounted YAML at DrupalCampColorado - but probably because there was no native parser. This fact hasn't been a problem for other frameworks, so I don't think it should be as high of a roadblock as maybe others see it.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

I think it was dismissed

webchick's picture

I think it was dismissed because we were thinking L1 and L2 had to be the same type of data storage, but then it was later pointed out that they could actually be separate, and L1 could choose a human-editable version, where L2 could be optimized for performance.

Symfony

Crell's picture

I am considering pushing for the inclusion of the Symfony HTTP library anyway as part of WSCCI, which would require the same licensing discussion. So were we to go that route it would be a good collective discussion to have regarding using Symfony components.

Not much to discuss

chx's picture

INI is out, XML is out due to the extreme difficulties of parsing.

The on disk format is very likely YAML due to it is being the most human friendly. It could be JSON. One needs to study all the YAML parsers above re http://en.wikipedia.org/wiki/YAML#Pitfalls_and_implementation_defects pitfalls. This would be way more useful than reposting the list of YAML parsers from the stackoverflow post one by one.

The active store, now there's tricky. PHP has a memory burden for sure and also a shared memory burden with APC. Not good. JSON -- if we use JSON in the active store then we can not support objects and arrays only one of them. We discussed this at DC Colorado and it seemed that literals and associative arrays are enough. And then we can use YAML. How fast are the available implementations? More importantly, how much we care? We ship with userspace but either contrib or even the core implementation can fall back to the yaml PECL extension which I boldly presume to be much faster. But still it needs benchmarking, both userspace and pecl both, compared to json_decode().

Useful replies to this post include analysis of pitfalls and YAML benchmarks. Not that I think we will get any (unless of course catch benchmarks -- seemingly he has a license to do that and noone else dares). I am out of here, anyways.

geevee's picture

Sorry, if this annoys people :-) . but i think this might be worth having a look.

Will drupal benefit from using more advanced serialization tools like Apache Thrift/Protocol Buffers/Apache Avro ?

Some of them will auto generate code to create rpc servers. Sounded useful for the Services Initiative. They have good support for PHP. Infact apache Thrift was developed by facebook for use with PHP.

Only Apache Thrift has

geevee's picture

Only Apache Thrift has support for PHP, and that too JSON serialization is missing.

I love JSON, however...

mike503's picture

I love JSON, but I only love it as a lightweight transport dialect. Not a configuration file syntax. It's great for machine <-> machine consumption, and is somewhat readable if you need to, but it is not the best for defining stuff.

As for defining objects in a configuration file, I don't see the need. If normal scalars or arrays can't handle it, what the hell are you doing to begin with?

PHP syntax is nice because it has all the control structures, can be tricked out (if hostname == this, use these settings) - which has given me the capability to do things like drive multiple WordPress installations from a single codebase, for example.

XML is bulky, but the most universal and readable. While I hate it, there is an easy way to convert it into native PHP format, assuming you can get it into simpleXML and doing the json_decode(json_encode(), TRUE) trick. The first json_encode will take a an object structure and encode it into JSON, the second one will reconvert it back to an array - it works on the entire tree as well, instead of trying to typecast a nested object tree or mixed object/array tree. Think of it like object_to_array() or something (would be a nice feature to include in PHP to begin with)

XML, A clarification

EclipseGc's picture

I just want to take a moment to clarify my position on XML.

  1. First of all, I am not pushing for XML, it will either stand or fall on its own merits but I will continue to point them out when it seems opportune.
  2. I still really feel XML is a bit more readable than any of the other options here (or at least could be made so if we're careful with our spec)
  3. IF we choose XML, we aren't "stuck" with it. Transformations could be made to any other reasonable format (yaml, json, whatever which seemed to be a bit of a potential issue as we wanted 3rd party products to be able to consume our data if they had that ability, xml/xsl:t gives us the flexibility long term to output our data in any format we might desire for any 3rd party product that might be interested... we can even convert to another xml standard, or php, or or or...)

All that to say, I understand the concerns with the format, but there's a lot of flexibility inherent in the format as well, and as a L1 storage, I think it has a lot of possibilities. Also, this may be me showing my security ignorance here but... if we DID use XML couldn't we just encapsulate that in php, and send headers to say it is XML? In this way we should be able to devise a non-.htaccess security method for that file?

Eclipse

webchick's picture

Problems with JSON:
- Can't comment the metadata files (this is a biggie)
- No way to tell the difference between arrays and objects
- No way to put your own data types (e.g. an object of class type EntityFieldQuery) in there
- No way to preserve key order.

Pros with JSON:
- Much less verbose syntax than PHP or XML.
- Compatibility with external tools
- Native encoder/decoder

YAML seems to fix all of the cons of JSON, while retaining all of its pros except the native encoder/decoder. However, others have pointed out there's a PECL replacement for high-performance sites out there, and additionally if L1 and L2 use different storage (for example, L2 uses serialized PHP or JSON), the performance concerns are far, far reduced.

http://en.wikipedia.org/wiki/YAML was quite elucidating, particularly http://en.wikipedia.org/wiki/YAML#Data_types and http://en.wikipedia.org/wiki/YAML#JSON. YAML also has some kind of linting tools http://yamllint.com/ so there's that. It's also a format widely understood by Ruby programmers, which may be a potential way to bring in new contributors who are slightly less grumpy at having to deal with PHP. :)

Not entirely agreed

Crell's picture

I will agree that the lack of comments is an issue, but in practice we expect these files to be mostly machine read and written. Maintaining comments through that process would be a PITA anyway. I don't expect comments out of the default files that ship with modules.

Also, the more I think about it, the more I think not supporting extremely complex data is not a bad thing. In fact, do we even want complex class definitions within a config object?

Think: The persistence layer is a serialization mechanism for state inside the config objects. The config objects will sometimes be classed, but what we're storing isn't classed data; it's the serialized state OUT OF a classed object. That classed object is where the logic lives.

The choice of data format does have a significant impact on the API functionality and capability. Also recall that if we use JSON as an L2 format, then we still have all of the same limitations as if we'd used it at L2 in terms of what the data formats are it supports. So if we use YAML for L1, then L2 is either YAML or serialized PHP. There are no other options there. (Similarly if we go JSON on L1, then L2 is either JSON or PHP.)

I think a point Crell makes

Josh Benner's picture

I think a point Crell makes here needs to be emphasized.

A lot of us (myself included) have been pointing at JSON's lack of comments as a big negative, and as a positive for those formats that support comments. However, the Layer 1 store is auto-generated based on the defaults that modules provide. I don't imagine that we're going to go through the trouble of preserving comments into the Layer 1 store.

I can only see comments in Layer 1 being possibly useful if you do all of your configuration by editing the Layer 1 files. This might sound good -- but as my understanding goes right now, I expect to be doing a mix of manual editing and UI editing, so comments would seem likely to be obliterated when I save Layer 2 configs back out to Layer 1.

My preference is for YAML and

moshe weitzman's picture

My preference is for YAML and then JSON. I'll leave it at that, since others have covered the details.

As someone who alters

fabsor's picture

As someone who alters configuration exported from Panels, Views and exported fields by hand from time to time because some things go faster than doing it in the UI, I would really like to push for YAML, since it is way easier to deal with when doing changes to configuration by hand than JSON. Comments are also a big win.

//Fabian Sörqvist

Ask anyone that ever edited a

KrisBuytaert's picture

Ask anyone that ever edited a Yaml file manually if they want to do it again .

Spacing vs tabs etc in yaml is pain , especially when you are in an unfamiliar environment with your tools/editor configs unavailable

Just my $2c

spaces vs. tabs

catch's picture

Most Drupal developers should be very familiar with distinguishing between spaces vs. tabs, there may be other reasons not to use it but I can't think of an open source project that cares more about indentation.

perusio's picture

I must say that I hate with a passion things like YAML even more then the extremely verbose XML. Relying on white space to outline a structure makes sense for a document, where there's not much harm done if you misplace a space somewhere. But in things to be consumed by machines and where there's an implied machine logic is a very bad idea. There's a lot of room for hair pulling sessions when trying to debug a config file :(

I continue to believe that JSON albeit not being perfect offers the best solution, be it in terms of DX, be in in terms of interoperability.

Agree that YAML is a bad idea

javi-er's picture

Although Python proves that white spaces can work very well for defining structures, I agree that in the case of YAML if you store complex data structures it can become a pain to debug.

javi-er's picture

Personally I'm a big fan of JSON and I would love to see plain JSON configuration files without PHP on it, however in this case that the objective is to have a fool-proof configuration management in files I believe the best alternative is to keep using what Drupal already did until here and go with PHP data structures.

PHP it's the only alternative that will work in the dumbest environment, and in the case that the web server returns PHP files as plain text they will have bigger issues than the exposed configuration files.

Wordpress uses PHP

Sylvain Lecoy's picture

Wordpress uses PHP configuration as Active Configuration Layer.

The goal would be, if APC enabled, to totally skip database Query and read directly from cache.

I mention wordpress because if you compare the total request to display a page, in drupal this about to one hundred**, in wordpress only about 10 requests. The bad habit of using DB in drupal led to bad performances so using straight php files can help fill the gap.

** http://mikeschinkel.com/blog/17-reasons-wordpress-is-a-better-cms-than-d...

Level 2

Crell's picture

You're confusing Level 1 with Level 2. Level 1 is (will be?) a fixed-format canonical storage, not something regularly read. Level 2 is where most of the action will be (vis, the one that needs to be fast), and Level 2 we want to make pluggable. I don't see why one couldn't write a PHP-on-disk Level 2 implementation if it made sense to do so. (I'm not entirely convinced it would, since that has its own issues as an approach, but it's certainly possible.)

Bear in mind that in Drupal 7, if you move caching to memcache then the DB traffic is fairly light, too. It could be argued that the fact we need that much caching is itself a design flaw, and I'd likely agree, but that's a separate question.

No i'm not, I'm saying APC

Sylvain Lecoy's picture

No i'm not, I'm saying APC will pass through level 2, skipping completely Database calls.

leve2 vs. level 1 and APC

catch's picture

You need to actually store something in your level 2 store. All the discussion so far has pointed towards an optional level 1 store, and a pluggable level 2 store.

Whatever file format is decided on, nothing will stop you from building a level 2 store that is PHP files on disk - because any file format is eventually going to be resolved to PHP structures in the end.

However even if the default and level 1 storage was also PHP files on disk, they would need to be different sets of files.

A configuration where you had PHP files for the level 1 store, and those PHP files were also pointed to by the level 2 store would mean there'd be no way to stage changes, no ability to rollback changes from the UI etc. - the principal idea of having two separate stores is that they can diverge and be synced, so a 'fall-through' to the same set of files goes against this.

Since the Level 2 store is pluggable, and the Level 1 store is optional you could skip having the level 1 store altogether. Then the only reading from the default files happens when modules are installed.

Installing and uninstalling modules is not a runtime performance issue (there are memory, caching, locking and various other issues, but raw performance from reading configuration files not so much here).

As defaults are read once then never accessed again, having those in PHP files does not benefit from APC whatsoever, since the APC cache for that file will always be cold when it is accessed, and then it will cycle out of that cache before the next time it's accessed (which feasibly might be never again). Files don't get cached by APC until they're accessed, and caching a file in APC does no good if it's only accessed once on a cache miss.

<blockquote|A configuration

Sylvain Lecoy's picture

A configuration where you had PHP files for the level 1 store, and those PHP files were also pointed to by the level 2 store would mean there'd be no way to stage changes, no ability to rollback changes from the UI etc.

It is mainly the goal, more for developers than users for reasons you mentionned, this would allow better performances, better control for replication, and easier version control workflow. Thanks for highlighting it. That what I had in mind.

What is the difference

catch's picture

What is the difference between this, and having just a level 2 store (which could likely still be locked from any UI changes in production)?

All your runtime configuration could still then be read from PHP. With this set up it does not matter at all what the defaults or Level 1 store uses - unless you want to rebuild configuration from disk back to the PHP files, but that is going to be rare (and could be a pre-deploy thing - nothing in particular stops you keeping the level 2 store in version control if you wanted either).

I believe the "this" you

Sylvain Lecoy's picture

I believe the "this" you mention refers to Level 1 canonical store and Level 2 active store, while having just a level 2 store refers to my idea. (Tell me if I'm wrong).

Well, with "this" solution you have the level 1 store, which is under version control, and which is the canonical configuration. The level 2 store is not necessary stored on disk as the default implementation will use Database.

My proposition is if we use PHP as canonical configuration store (level 1), when we swap the implementation of the level 2 by APC for instance, we can read straight from those PHP files the configuration. I believe it will not change a lot in term of performances (reading from RAM is always better than a SQL Query), but in term of version control, as soon as the cache is refreshed, once you commit a change, it get replicated with the cost of nothing. Thing which is not possible if level 1 is not PHP compliant.

The thing that I dislike would be to have a level 1 PHP store, with a level 2 PHP store, being a replica (+ deltas) of the former.

In my opinion Level 0 (the

pounard's picture

In my opinion Level 0 (the files) and Level 1 (they raw cache) should be pluggable, but not Level 2. If I understood well, Level 2 is optimized data, ready to be use directly (maybe some serialized objects and stuff), probably the nearest thing you have to cache (or formalized data storage, I don't care, it's ready to use data for the core, it doesn't need any kind of inter-operability). As you describe it (not catch particularly, but everyone) Level 2 is either a cache, either the variable table equivalent, in both case I don't see how or why it should be pluggable because data you manipulate here already has been loaded and altered by the lower layers.

Pierre.

oh dear

catch's picture

There is a lot wrong with Drupal 7 in terms of performance, some of that includes configuration (although rarely in terms of number of database queries executed), but you will not find it in flamebait blog posts by people who admit they hadn't even looked at Drupal 7 when they wrote it.

fwiw a more or less stock Drupal 6 install could take around 100 queries on most pages. This was not due to storing configuration in the database but due to path aliases.

A reasonably stock Drupal 7 install on my localhost takes 21 queries, 14 of those are from cache_get().

Logged in as user/1 with the toolbar and other modules, it is closer to 35 queries (with over 20 of those from cache_get()). Still nowhere near 100.

You're also ignoring that most of this dicussion is talking about the Level 1 store - which is not the active layer. As Crell's post points out, level 2 is likely to be limited to level 1 format + PHP - which means it should always be possible to read from PHP (serialized or potentially included from disk) during runtime regardless of the chosen format.

oh dude

Sylvain Lecoy's picture

It wasn't at all a flame-bait, otherwise I would have moved to wordpress months ago..! I actually didn't myself reinstalled the devel module to check performances of Drupal 7, thank you for correcting me :)

I am perfectly aware that the discussion is about the level 1 format. I noticed that if the level 1 format is PHP, then we can use it as active layer when APC is enabled for instance. Being read-only (or read/write if the UI can generate PHP back - which would be not too hard to do).

Special case

Crell's picture

You don't need the L1 format to be PHP for the L2 storage to be PHP-on-disk-with-APC. With L2 pluggable, you could write such an L2 implementation quite easily if it made sense for you use case to do so. It just would write to a different directory than the L1 config directory.

Such an L2 implementation is a fine idea to experiment with; I don't know if it would be great or terrible for performance, but it's certainly worth investigating. However, using PHP as the L1 canonical format is not a prerequisite to doing so. That's the point that catch and I are making. It doesn't have to be a "null" implementation in order to work, which would have the downside of not being able to diff config changes (which is a benefit of having the separate L1 and L2 layers).

Just to clarify, I meant the

catch's picture

Just to clarify, I meant the Wordpress article was flamebait (and they had clearly not themselves looked at Drupal 7). Posting the article was more regurgitating flame bait ;)

There something bothering me

pounard's picture

There something bothering me with this whole configuration discussion:

  1. If you want interoperability, you cannot use configuration objects specialization provided by modules. For interoperability you can only rely on primitive types storage (eventually ordered lists depending on the chosen formalism) but you cannot go further. Any complex and specialized configuration object storage destroys that.

  2. There are more than on level of configuration (not speaking about storage level here). Basically finding out something for actual variables would so great, while the rest (Views and Content types) can definitely wait. Views and Content types are higher level business objects. They indeed almost are site configuration somehow, but not real low level configuration (or I don't know how to say it, but the nearest stuff we have from a windows/gconf like registry). You should separate those. If you want to store Views in JSON, I really don't care, but if you want to store variables in JSON, we are really not going to be friend, because I'm really eager to edit my variables manually on a day to day basis.

2b. In a certain point of view, Views are more entities (related to content) than real configuration. Views' place is between (but not in) configuration and content.

You should definitely separate those two different stuff in two different discussions thread, if not already. If already, please where is the discussion about variables? Because variables are the configuration, not Views. Views is pure business-oriented/data-related integration.

Pierre.

2 points about

kika's picture

2 points about yaml:

  1. Symfony has a nice parser:
    http://symfony.com/doc/current/reference/YAML.html
    http://api.symfony.com/2.0/Symfony/Component/Yaml.html

  2. module.info.yaml ?

Google ditched XML and now uses JSON

cosmicdreams's picture

Here's a recent presentation about the benefits of ditching XML and using JSON as a basis for public APIs. If you envision a public system to drive the configuration of site this is 100% applicable.

http://goo.gl/x1rEj

Perhaps we should re-evaluate the decision to choose XML. If JSON is robust enough for Google's APIs why can't it work for
us?

Key slides:
10 : detriments of using XML as found from their experience dealing with developers who implement their APIs.
14, 16 & 17 : new features built on top of JSON
21 : Example of the API could be consumed with PHP.
31 : JSON implemented objects
38 : An example of a large object.

Software Engineer @ The Nerdery

Well

perusio's picture

at the time the many faults of XML were extolled. But nevertheless somehow XML was chosen. I didn't attend the core session at DC London where the rationale for choosing XML was supposedly explained. I ended up voting with my feet, meaning I'm not commiting any time for supporting a legacy format for config management. If by any means the situation turns around, then I'm back in the loop.

Short-sighted

Crell's picture

There's a lot that needs to go into a config system other than the file format. It's really rather short-sighted and petty to "take your ball and go home" just because of one encapsulated decision you disagree with. Many faults of JSON were extolled as well in those epic threads, and it was clear that there was not going to be a complete consensus around either format. As initiative lead, Greg made a call to use XML so that we could move forward rather than continue bikeshedding ad nausiem. I'm not a huge fan of XML either, but the important point is that we can move forward and build the API we need. And if it's done properly (ie, well-encapsulated and abstracted), we could potentially change the format later without unraveling the whole system.

If you don't have time to help, or it's not your area of expertise, or a problem space you're interested in, that's fine. But refusing to help ensure the system is well-encapsulated and abstracted because you disagree with one part of it isn't voting with your feet. It's pouting.

Very well put. My vote was

frob's picture

Very well put.

My vote was for JSON (in so far as it was against XML), however, XML will work and if the functionality is encapsulated then a move to another independent intermediary format should be relatively painless.

So lets stop bitching and lets start coding.

Cannot +1 this enough... :(

patcon's picture

Cannot +1 this enough... :(

Google ditched XML and now uses JSON

montyd's picture

Yes, JSON is the data language of Ajax, the data language of Web 3.0, and as the spec for JSON allows comments, it can be wonderfully human readable and self documenting. Not just Google is ditching XML for JSON... almost every new API uses JSON. +1!

A note on the JSON UTF8 limitation

neclimdul's picture

So I can find it again, the referenced UTF8 limitiation for JSON in the summary is that the 5.3 encoder(and 5.4 without a flag) escapes utf8 entities making it harder to read(and edit) on disk which is a valid concern for configuration files. Fixing it would require a wrapper for json functions which is less than ideal.

Please don't respond and reopen the thread, this is just a note for posterity. The decision was already made.