Initial format discussion

Events happening in the community are now at Drupal community events on www.drupal.org.
bjaspan's picture

[ This group initially started as an email thread. This discussion is for re-posting the content of the thread and then continuing the discussion about the goals and layout of the archive format. ]

My initial idea is that the basic format will look like what Drupal Gardens already generates and what I think Pantheon imports: a gzip'ed tarball with code, ./files, and a .sql dump file in the root. However, there are many other things the format could support, such as:

  1. A specifiable format for the sql dump: mysqldump, pure SQL, whatever.
  2. Where the code lives (e.g. ./ or ./docroot), where the files live (./files, ./sites/default/files), where the database lives (ditto) within the tarball.
  3. Multi-site archives, with a separate database and files dump for a specified list of sites directories.
  4. A public key signature.
  5. Environment dependencies, e.g. PHP & MySQL version, PHP extensions, etc.

My thought is that the archive format should be gzip'ed tarball with an optional MANIFEST.txt file specifying all the above info. Every field in the manifest would have a default, and all the defaults taken together (which is the same as having no manifest) leads to the basic tarball+mysqldump format.

Comments

I'm not aware of any

webkenny's picture

I'm not aware of any particular solution to this problem but it has other hidden benefits in the deployment space. Well, at least in initial deployment. Anything beyond would be configuration management and best handled elsewhere in Drupal. From a demand perspective, this is high among developers -and- stakeholders so I'm happy to see this conversation taking shape formally.

My number one technical concern is making sure the export or process that creates the manifest has some knowledge of, or is able to be given knowledge of by way of the exporting system's cues, what should and should not be exported. At least in the Drupal Gardens sense, we don't send Theme Builder along with every export so we'd need to be able to tell the export that. Similarly, other providers (especially those where SaaS or other IP based on Drupal is concerned) will want to have the same conversation with its export.

From a purely functional standpoint, not having that ability and just sending out a "dumb" exportable (thanks to jbrauer for that term) could mean broken sites even in the non-imported sense. (As in, "Hey, Where the hell is module X this database says I have?") - It could also undermine the reason for this effort. Easy portability.

I'm more alluding to the fact that while an Export this Site button is great for the average implementation, others could benefit from an Export API which the exporting system could then leverage to create this format based on the specification the community provides.

Lastly, I'd be more inclined to see manifest.txt take on another, more known and accepted format for defining the export rather than another let's roll our own implementation (Remember Bones?).


Kenny Silanskas
Proud Member of the Acquia Support Team

Manifest destinty

joshk's picture

Can we stick with the good old ini/.info format for the MANIFEST? There's a handy PHP function for parsing these and it's already something of a standard in Drupal.

More importantly, I want to second this:

Every field in the manifest would have a default, and all the defaults taken together (which is the same as having no manifest) leads to the basic tarball+mysqldump format.

I would love to see us (the community at large, but especially people who work on these sorts of wide-scale deployments) come up with a set of consensus Best Practices for the "right" way to structure an export.

This would also, by design, provide people with a "right" way to structure a running site's code tree. There are a lot of equally-functional ways to do this, which makes it confusing and tempts people to try and invent their own solution — not the greatest investment of creative energy IMHO. It's also something I see newcomers (even/especially those with a related background in software) stub toes on often.

My number one technical

scor's picture

My number one technical concern is making sure the export or process that creates the manifest has some knowledge of, or is able to be given knowledge of by way of the exporting system's cues, what should and should not be exported. At least in the Drupal Gardens sense, we don't send Theme Builder along with every export so we'd need to be able to tell the export that.

Along the same lines, there are tables you will most likely want to leave out when dumping a database (such as the bloated cache_form and friends) though you do want to export their structure. Other tables should not be in the db dump at all (e.g. theme builder related tables maybe). In the Drush command dgb, I use drushrc.php much like a custom MANIFEST to tell mysqldump what tables to dump or ignore. It has some logic to handle a set of default tables to leave out using wildcards and ignoring non existing tables. These features might land in Drush 4.

Obviously I'm glad to see some initiative in standardizing a site archive format. While I'm interested in making sure it integrates with a VCS like git (I use it for backup and deployment), I do see some big benefits for making tarballs too.

Anyone care to take a stab at

moshe weitzman's picture

Anyone care to take a stab at an example MANIFEST file? I agree that .info format makes sense.

Has anybody put any more

ronan's picture

Has anybody put any more thought into this? I like this initiative and I hope it keeps moving forward. Here are some of my reactions:

1) Ditto on .info format.

2) I'd favor standardization over specification. In other words, rather than allowing the author to specify the location of the db dump, simply require it to be at the top level of the docroot, perhaps with a standardized name or naming convention. There should be some flexibility but IMHO we don't need variation just for the sake of variation. That will make it much easier to write code that consumes these archives.

3) Would it be useful to have an additional file extension or another way to identify a site archive without having to untar it and inspect it's contents? I don't think an additional extension before the tar/tgz/zip extension would do any harm (ie: sitename.xxx.tar.gz). Some ideas for file: .drupal, .drop

4) Do we even need to account for multi-site archives? It seems like the purpose of a site archive is to package up a single site into a reusable format. Multi-site is really just a factor of how that site is hosted and maybe not relevant to this problem. Having the possibility of multiple sites in a single archive or having a site not in 'sites/default' makes consuming these archives quite a bit more complicated and reduces the portability of the archive. I wouldn't be against a variation of the archive format that can store multiple sites in one, but I'm not sure the specifics of the multi-site setup are particularly relevant even then. I might be off-base though since I don't use multi-site much myself.

Proposed MANIFEST format

ronan's picture

Ok, I'll take a have a go at formatting the manifest file:

MANIFEST.info

; Some brief intro could go here for any human who is curous enough to open this file.

; Archive information
datestamp = "1291500673"
formatversion = "1.0"
generator = "Module/Service Name"
generatorversion = "7.x-2.3"

; Descriptive information set by the user who created the archive.
description = "A user definable description of the archive."
tags[] = "tags"
tags[] = "defined by user"

; Site specific information
sites[0][version] = "7.1"
sites[0][name] = "Site Name Here"
sites[0][docroot] = "path/to/docroot"
sites[0][sitedir] = "path/to/site/directory"
sites[0][files][public] = "path/to/public/files"
sites[0][files][private] = "path/to/private/files"
sites[0][database][default][file] = "path/to/dbdump.sql"
sites[0][database][default][driver] = "mysql"
sites[0][database][another][file] = "path/to/otherdbdump.sql"
sites[0][database][another][driver] = "sqlite"

; Second site in multisite configuration
sites[1][version] = "7.1"
sites[1][name] = "Other Multi-site Name"
sites[1][sitedir] = "path/to/othersite/directory"
sites[1][docroot] = "path/to/other/docroot"
sites[1][files][public] = "path/to/public/files"
sites[1][files][private] = "path/to/private/files"
sites[1][database][default][file] = "path/to/onemoredbdump.sql"
sites[1][database][default][driver] = "mysql"

And if there is no MANIFEST.info we assume the following defaults

formatversion = "1.0"
sites[0][docroot] = "./"
sites[0][sitedir] = "./sites/default"
sites[0][files][public] = "./sites/default/files"
sites[0][database][default][file] = "./database.sql"
sites[0][database][default][driver] = "mysql"

Which roughly equates to the existing de-facto tarball+mysqldump format. It differs from what Drupal Gardens currently exports because it specifies a standard name for the database dump while Gardens uses 'sitename.sql'. We could account for this by allowing wildcards and making the default:

sites[][database][default][file] = "./*.sql"

But that makes the job of the archive consumer a little bit harder. I suspect that most consumers would want to implement this feature whether it's a part of the standard or not since Gardens' archives are an important precedence.

The format allows for specifying both public and private files for D7, but since D6 only allows for one or the other and not both, that needs to be handled differently. We coud say that D6 archives always specify sites[][files][public] for the files directory regardless of the actual file mode set for the site since it probably doesn't matter and it makes the job of the consumer easier. It makes the format a little misleading though. If people feels strongly against this, then I'd recommend:

sites[0][filemode] = "private"
sites[0][files][private] = "path/to/files"

or
sites[0][filemode] = "public"
sites[0][files][public] = "path/to/files"

To keep the job of the consumer easier. This would add:
sites[0][filemode] = "public"

to the implied default.

I have eased-off on point 4 from my last post and added multi-site to this proposal since it's easier to specify it and allow a consumer that didn't support multi-site to just read from sites[0]*. I've also completely reversed my position on point 1 and gone with way-over-specification instead of standardization for this proposal. The reason for this is that I wanted to maintain the current tarball+mysqldump standard but also allow for the layout I would favour which would be something more like:

MANIFEST.info
docroot/
  authorize.php
  CHANGELOG.TXT
  COPYRIGHT.txt
  cron.php
  ...
database/
  somedatabasefile.sql
  anotherdatabase.sql

I prefer this layout since it keeps the database and manifest out of the codbase so it reduces the chance that a user will accedientally upload these files to the root of their site and expose their data to the public.

I didn't address all of Barry's ideas for the manifest becuase I couldn't really envision how all that info would be added, so feel free to weigh in on other stuff to add. Whatever we put in the spec, we should allow archive generators to add additional fields that they feel might be useful and allow consumers to specify additional optional fields that have some meaning for that consumer. We could specify a required prefix or section for these custom fields if we think it necessary, but there will probably be few enough consumers and generators that collisions may not be a huge worry.

Thanks for reading all the way through this long post. Please weigh in with comments, suggestions and criticisms.

Excellent.

bjaspan's picture

Ronan, this is excellent and exactly the kind of thing I had in mind. Detailed comments coming.

Here are some initial

bjaspan's picture

Here are some initial comments.

We should clarify the distinction between a "docroot" (codebase) and a "(multi-)site" (files, db, and settings.php file); I think Aegir uses the terms "platform" and "site" for this. In your example file, you have sites[0] with a docroot, files, and db, and sites[1] with the same, which makes it awkward to specify one docroot with multiple sites.

So, what if we could specify multiple docroots, then relate each site to a docroot, e.g.:

docroots[0][version] = '7.x'
docroots[0][path] = 'path/to/docroot'
; optional info
docroots[0][distro] = 'Drupal|Pressflow|Drupal Commons|OpenAtrium|...'
docroots[0][description] = 'My code base'

docroots[1]...

sites[0][docroot] = 0
sites[0][name] = "default"
sites[0][files]...
sites[0][database]...

sites[1][docroot] = 0
sites[1][name] = "mydomain.com"
sites[1][files]...
sites[1][database]...

sites[2][docroot] = 1

In other words, the sites[] section is basically laying out the sites directory.

For the default (no-MANIFEST) settings, I suggest:

  • The default database file be './*.sql', i.e.: any file name ending in *.sql, provided there is just one. This gets the .sql out ofthe docroot to prevents accidents and does not impose rules on how it is named.
  • The default docroot be dirname("./*/index.php"), i.e.: any directory name containing a Drupal docroot, provided there is just one of them.
  • An archive with no specified database file and no ./*.sql be declared as a valid archive file, just with no database. Combined with the previous two rules, this makes the standard Drupal distro format into a valid site archive so they can be imported by the same tools.

These rules are easy to implement; Acquia Hosting's site import code already does, and in fact our "Site install from Distro" code just uses the site import logic on a locally cached copy of the standard distro tarball.

I like the ability to specify [files][public] and [files][private] separately.
I don't think the archive format needs to worry about preventing a D6 archive from specifying public and private files. For all you know, someone is using a contrib module that supports that.

So, what if we could specify

ronan's picture

So, what if we could specify multiple docroots, then relate each site to a docroot

I would argue that this unnecessarily complicates the parsing of and consumption of an archive. Having multiple different Drupal bases in one archive is probably a rare edge-case (I can't personally think of why one would generate this or how one would practically consume it, Drupal multi-site certainly doesn't allow mixing of d6 and d7 sites) and repeating the same base info for each 'site' in the archive doesn't seem like a big deal, especially since these files won't likely be built or edited by hand.

I like the addition of 'distro' as long as it's optional (since I'm not aware of any way to reliably tell, programatically, what distribution you are running).

The default docroot be dirname("./*/index.php"), i.e.: any directory name containing a Drupal docroot, provided there is just one of them.

Do we need to establish some sort of rules or logic for a consumer to implement to determine what is a Drupal docroot? It may not matter since any producer concerned with ambiguity will provide a manifest. And any producer creating archives that are structurally ambiguous and lacking a manifest can expect their archives not to work for most consumers. The defaults should be there just to allow consumers to be backwards compatible with a few known pre-existing archive types (distro tarballs, Garden's exports, etc.)

An archive with no specified database file and no ./*.sql be declared as a valid archive file, just with no database

Agreed a producer should have the option to include/exclude whatever parts (of codebase, db, site files) it wants.

Combined with the previous two rules, this makes the standard Drupal distro format into a valid site archive so they can be imported by the same tools

That's a cool feature. I may be misunderstanding the directory structure of tarballs, but doesn't ("./*/index.php") mean that your docroot has to be one directory down from the root of the tarball? Wouldn't that mean that Drupal tarballs are not valid archives? If I misunderstanding the terminology, then I agree completely, and the paths in my last post are not what I meant at all. I was thinking that a site archive tarball would expand to a single directory and that directory (like a Drupal tarball or Gardens export does) would be considered '.' in the paths I listed above.

I don't think the archive format needs to worry about preventing a D6 archive from specifying public and private files. For all you know, someone is using a contrib module that supports that.

I'm fine with that as long as a consumer with no knowledge of that hypothetical contrib module knows where to find the 'normal' D6 Drupal files directory (be it specified as public or private by configuration) within the archive. The use case I'm thinking of here is the abliity to restore/import just the files from a site archive to the current site's files directory.

Also when exporting from a D6 site an archive generator probably only knows about the files directory specified in the config and has to label that directory in the manifest one way or the other. The second of my previous suggestions (specifying as a separate value whether the main files directory is public or private and having that value be the key for the path value) will parsing easy and still allow additional files directories to be added for consumers that know what to do with them. This is trivial for a generator it has access to these drupal settings at creation time (eg: the generator is a module running on the site being archived) if this info is not available (eg: external archiving application that doesn't want to peek into the db), then the generator could be allowed to just make one up:

sites[0][filemode] = 'unknown'
sites[0][files][unknown] = 'path/to/files/dir'

Or maybe even ommit the path to the files dir (if it doesn't know that either). A consumer would then be under no obligation to specifically address the files directory unless it exists at the default location ('./sites/default/files').

db logging format

damienmckenna's picture

I'm a MAJOR fan of the following mysqldump arguments:
-Q --add-drop-table -c --order-by-primary --extended-insert=FALSE --no-create-db=TRUE
The mean reason I use this is that it makes each INSERT statement its own separate & completely valid & debuggable statement, which means you can grep the file for specific values. It might make the file larger, but the greatly improved usability of the SQL file is well worth it.

Are you suggesting the

bjaspan's picture

Are you suggesting the mysqldump options should be specified as part of the manifest file or the specification? I'd say not. Certainly there are good reasons to choose some options over others, but whatever program generates the archive file should get to choose (possibly under human control).

Probably there should be a place in the manifest data structure to specify the mysqldump (or other db export) options used so they can be displayed by a UI along with all the other data associated with a file to imported.

I agree. I don't think that

ronan's picture

I agree.

I don't think that we should specify exact formatting for the dump files other than to require that they live up to some usable standard. For mysql, for example, they should be a series of valid MySQL commands or comments (and make the ability to use '.' to read them with the MySQL command line as the pass/fail test for compatibility, say).

We should have similar simple litmus tests for other DB drivers.

Recording those options in the manifest certainly doesn't hurt though and will at very help those who write archive-consuming software to debug.

I just committed a new

moshe weitzman's picture

I just committed a new archive-dump command to drush core. It pretty much implements what Ronan proposed. See http://drupalcode.org/viewvc/drupal/contributions/modules/drush/commands...

  1. It would be easy for me to add multiple docroots in the same file as Barry suggests.
  2. What about backing up custom stuff like mongodb field storage? Out of scope?
  3. I haven’t tested nor thought much about how symlinks should be archived and restored.
  4. I'm hoping someone else is up for writing an archive-restore command.
  5. I'm OK if we decide to move this outside of drush core for some reason.

Yeah, I'm up for a --no-core

moshe weitzman's picture

Yeah, I'm up for a --no-core option. Easy to add. Does sites/all go into the Aegir tarball? anarcat makes it sound like it is just sites/example.com

For those who can't seem to try out archive-dump for themselves, http://cyrve.com/a.tar.gz is the site archive for my test drupal 6 site. also see transcript below to see how the sausage gets made. http://cyrve.com/MANIFEST.txt is the MANIFEST.info for that archive.

Comments welcome.

~/htd/d6$ drush archive-dump default,620 --destination=/tmp/a.tar -v
Initialized Drupal 6.20-dev root directory at /Users/mw/htd/d6          [notice]
Initialized Drupal site default at sites/default                        [notice]
Executing: tar --exclude 'sites/*' -cf '/tmp/a.tar' '.'
Executing: tar -rf '/tmp/a.tar' './sites/all'
Executing: mkdir '/tmp/drush_tmp_1297033985'
Executing: mysqldump --result-file /tmp/drush_tmp_1297033985/d6.sql --single-transaction --opt -Q  d6
Calling chdir(/tmp/drush_tmp_1297033985)
Executing: tar -rf '/tmp/a.tar' 'd6.sql'
Calling chdir(/Users/mw/htd/d6)
Executing: mkdir '/tmp/drush_tmp_1297033986'
Executing: mysqldump --result-file /tmp/drush_tmp_1297033986/620.sql --single-transaction --opt -Q  620
Calling chdir(/tmp/drush_tmp_1297033986)
Executing: tar -rf '/tmp/a.tar' '620.sql'
Calling chdir(/Users/mw/htd/d6)
Executing: tar -rf '/tmp/a.tar' './sites/default'
Executing: tar -rf '/tmp/a.tar' './sites/620'
Calling chdir(/tmp/drush_tmp_1297033986)
Executing: tar -rf '/tmp/a.tar' 'MANIFEST.info'
Calling chdir(/Users/mw/htd/d6)
Executing: gzip -f '/tmp/a.tar'
Command dispatch complete                                               [notice]

Correction -

moshe weitzman's picture

Correction - http://cyrve.com/MANIFEST.txt is the manifest file.

Aegir tarballs do not have

anarcat's picture

Aegir tarballs do not have sites/all, by definition (they are done relative to sites/example.com). A compatible aegir dump would indeed be just sites/example.com, although this does create problems for us - if, for example, you want to clone site example.com to clone.example.com on the same platform, the backup format will be problematic because you need to restore to a remporary directory and move it in place instead of just untarring within sites/clone.example.com.

That's a general problem with the new form - you can't just untar them in place (since you may be changing the site name) - but I feel this is something that can be worked around. It's also a feature because it documents the old_uri of the site (in case you change the site name, Aegir fixes your files/node_revision/etc tables to follow the new urls...)

Moshe, you rock. :-) Replies

bjaspan's picture

Moshe, you rock. :-) Replies to your comments:

  1. As per my reply to anarcat below, perhaps we should just skip multiple docroot support at this point, i.e.: in version 1, make docroot (location of the code) a global, not per-site, setting. Until we have a clear use-case, simplicity is better.

  2. Well, we do have the ability to specify database format. I guess for "stuff like mongodb" we'd need multiple databases per multi-site. Frankly this can be useful for pure mysql as well, some sites use more than one db. I'd be happy to include this or skip it for v1.

  3. ISTM that in general symlinks should be resolved during generation; the goal is to create a self-contained archive. The generator can/should certainly have an option to not resolve them.

  4. I think that archive-restore is going to be very environment-specific. On Acquia Hosting, for example, based on a site's configuration, we need to put the code into a repo, the files into a network filesystem, and the database(s) onto the correct servers; then we need to modify the appropriate settings.php files to use our "settings include file" that manages the db connections, failover, etc. (Incidentally, I've written this importer already as a drush command, we use it for our Site Import functionality.) Even in a simple environment, importing will require knowing the correct docroot path. I guess we could write a simple importer that accepted args for docroot path and db name/creds. Should the standard importer modify settings.php?

  5. I presently have no opinion re: drush core or not.

anarcat's picture

Just a quick note to mention that aegir already has a metadata format. It doesn't have a manifest file, because it doesn't need any - a backup is from a specific "platform" (a drupal code instance), and can be deployed on other instances with the provision-deploy commands, which readily runs drush updatedb for you if necessary.

In aegir backups, each site is in its own tarball. The tar is created as if you were within the sites/example.com directory so you have:

files/
modules/
themes/
settings.php
database.sql

... in the tarball.

The only thing missing to make this really portable is a platform descriptor. In Aegir, we're trying to focus on makefiles so that we have a clear idea (instead of a bunch of random code) of what the site is (openatrium, acquia, etc). I was thinking of adding a .make to that backup to make sure that you can replicate the platform the site was sitting in (could be created with drush generate-make if drush make is around). The problem is that then you basically rely on drush make for portability, which may not cover all use cases (especially if you have custom, non-distributed modules).

So the proposal here is interesting for us to implement what we call exportable backups.

I'll try to think more about the proposed format, but right now, these are the issues I can think of:

  1. I am not quite clear why we need a manifest in the first place, if files are placed in a proper standard location during backup, restore can just rebuild cleanly without a manifest
  2. It's a problem for Aegir if the format includes the whole "platform" as we use backup/restore to transfer sites between servers and do mass upgrades - basically when you control the platform you don't need to bundle the whole code... A format that would allow for that would be more useful for Aegir. Maybe a --no-core flag?
  3. I am really unclear as to why we have a format that specifies multiple docroots - why don't we have a tarball <-> Drupal multisite one to one mapping at least? One drupal multisite, one tarball. Make the whole thing much simpler at every step of the way...

That's all I can think of right now. I am sorry I didn't see that thread earlier.

Answers

bjaspan's picture
  1. The manifest file is useful for a number of reasons. The db dump may be mysql, or sqlite3, or mongo, or multiple different dbs, etc. The files directory may not be stored in sites/*/files. We should probably provide some form of signature for verification. Even just displaying meta-data during import is pretty useful.

  2. It's fine for the archive format to support not containing core. This requires specifying the exact platform (in Aegir-speak) that is expected if the import is going to be portable to a non-Aegir environment, which is another useful bit of information for the manifest file. :-)

  3. Actually I'm not sure that we do need multiple docroots in one archive. Looking at my original post on this thread I certainly did not seem to envision it. Multiple multi-sites per docroot is very important, though.

I would strongly encourage

anarcat's picture

I would strongly encourage not having multiple docroots, so I am happy to feel that we're in agreement here.. This will make everything clearer and simpler.

For the DB types, can't the file extensions be sufficient for decribing the data (*.mysql...). Just a guess... I wonder if we could have portable ANSI dumps... ;)

We do not absolutely need to specify the parent platform - sometimes that's just impossible to do when it's a really custom platform, maybe in a makefile ...

It sounds like aegir's format

ronan's picture

It sounds like aegir's format would work as a site archive by just adding this manifest.txt file to the root of tarball:

datestamp = "x"
formatversion = "xx"
generator = "Aegir"
generatorversion = "??"

sites[0][docroot] = ""
sites[0][sitedir] = "./"
sites[0][database][default][file] = "database.sql"
sites[0][database][default][driver] = "mysql"

I'm all for having each of the components (root, site, files, db) be optional, but we should probably have a standard for specifying a none (I've used empty string above) since the implied defaults assume a certain value.

I'd say it's probably ok to allow/disallow multiple docroots and multiple sites in the format as long as we don't require generators or consumers to support more than one at a time (ie: if you import a site with 'Bob's Drupal Builder' it can just pull the first site listed in the MANIFEST if that's all that makes sense). Both of the manifest formats discussed above allow but don't require multiple sites and different docroots. The cost is repeating the docroot path (and maybe version etc.) once for each site in the archive, so we're talking about a couple of extra bytes in a compressed tarball. Hopefully this flexibility will allow us to have a backwards and forwards compatible v.2 of the format if and when somebody comes up with a use-case for multiple roots etc.

For the DB types, can't the file extensions be sufficient for decribing the data (*.mysql...)

Sure, but since that info is probably available at the time of generation it should be trivial to add it to the manifest. That'll make consumption that much easier.

default manifest data

anarcat's picture

I like that manifest - can we make that the default? ;)

Basically, could we say that a file without a manifest is the builtin aegir format? That would make my life a whole lot simpler...

Maybe aegir and acquia can

ronan's picture

Maybe aegir and acquia can arm-wrestle for the default :)

Makefile as code backup

ronan's picture

@anarcat also brings up a possibility that had occurred to me as a possible future feature for the format. The idea of using a makefile or makefile formatting to specify the code that the site is made of rather than (or in addition to) adding the code itself to the archive.

It certainly won't cover all use cases, and requires a knowledge of how a site is built that most generators won't have (like whether modules have been patched, where to find non-contrib modules etc.) so it's probably a pretty complicated thing to build. Since well over 90% of the code in most drupal installs is available through public cvs/svn/git somewhere this could be a nice way to make archives more efficient. There'd be a lot of details to work out though.

I don't think this is practical to add to a v.1 of the spec, but it's a cool possibility for later versions.

Another generator

ronan's picture

Well a spec is can't be complete until there are 3 independent implementations, so here's mine:

https://github.com/ronan/backup_migrate_archive

It's a pretty rough proof of concept and it requires Backup and Migrate (and only works on D6). It's also a PHP-only solution (no command line tar or mysqldump) so it should run just about anywhere but only on small sites.

Sample output at http://gortonstudios.com/archiver-2011-02-07T23-11-22.sitearchive.tar.gz

Nice work! A couple

moshe weitzman's picture

Nice work! A couple differences I noticed from my implementation.

You put all the code under a docroot directory. That makes sense. So shall we abandon the idea that default drupal distro is an archive? Thats the only reason to put code in the root of the archive.

You used the name MANIFEST.txt but your spec says MANIFEST.info :)

You put all the code under a

ronan's picture

You put all the code under a docroot directory. That makes sense. So shall we abandon the idea that default drupal distro is an archive? Thats the only reason to put code in the root of the archive.

I actually like that the format allows either layout and that the manifest makes it trivial for a consumer to support either one. I think it's still valuable to have the drupal distro be an archive (with a an implied manifest specifying that the docroot is at the root of the archive). This also allows the current Gardens export and Pantheon import formats to be de-facto site archives--making Gardens our 3rd generator implementation for v0.1, and Pantheon our first consumer :). The Aegir format can be an archive by the addition of a manifest file.

You used the name MANIFEST.txt but your spec says MANIFEST.info :)

Ooops :). Fixed that on github.

Existing support for site archives

bjaspan's picture

"This also allows the current Gardens export and Pantheon import formats to be de-facto site archives--making Gardens our 3rd generator implementation for v0.1, and Pantheon our first consumer :)."

FYI, site import using the current de-facto format (for uploaded tarballs and Drupal distros) has been supported live, publicly available, on Acquia Hosting since December. See http://acquia.com/blog/importing-drupal-site-acquia-hosting where I demo exporting from Gardens and importing into Acquia Hosting. Sorry, just couldn't let this pass by. :)

At Drupalcon Chicago we'll be announcing Acquia DevCloud, our developer-focused Drupal hosting offering. I'll be very happy to announce official import and export support for the Drupal Site Archive format if it is ready by then. The fact that there is already a Drush export command and our internal import command written makes it pretty easy. :-)

FYI, site import using the

ronan's picture

FYI, site import using the current de-facto format (for uploaded tarballs and Drupal distros) has been supported live, publicly available, on Acquia Hosting since December. See http://acquia.com/blog/importing-drupal-site-acquia-hosting where I demo exporting from Gardens and importing into Acquia Hosting. Sorry, just couldn't let this pass by. :)

I stand corrected :) So that makes 2 real-world importers and 2 proof-of-concept generators (and a bajillion de-facto archives in the wild). Pretty good.

At Drupalcon Chicago we'll be announcing Acquia DevCloud, our developer-focused Drupal hosting offering.

Nice. I'm looking forward to learning more.

I really like the idea of

bjaspan's picture

I really like the idea of allowing the standard distro format to also be compatible with the standard site archive format; it has a nice sense of symmetry and it also means all archive importers will be able to import all existing Drupal distros with no further effort. It's fine for backup-and-migrate to put in an extra level, but is their an advantage to making that the default?

I agree, the assumed default

ronan's picture

I agree, the assumed default should be the format I described above:

formatversion = "1.0"
sites[0][docroot] = "./"
sites[0][sitedir] = "./sites/default"
sites[0][files][public] = "./sites/default/files"
sites[0][database][default][file] = "./*.sql"
sites[0][database][default][driver] = "mysql"

Which more or less describes the standard distro package (except for the .sql dump) and exactly matches the Gardens import/export format and the Pantheon import. Generators would be free to change this structure as needed as long as they specify the structure in a MANIFEST file.

Either I don't understand

bjaspan's picture

Either I don't understand what you are suggesting or we don't agree. The standard Drupal distro format includes a top-level directory with an arbitrary name which contains the docroot directly (not in another subdir). For example:

$ tar tzf drupal-7.0.tar.gz | grep index.php
drupal-7.0/index.php

So, shouldn't the default manifest file be:

formatversion = "1.0"
sites[0][docroot] = "./*" # must match only one dir containing index.php
sites[0][sitedir] = "[docroot]/sites/default"
sites[0][files][public] = "[docroot]/sites/default/files"
sites[0][database][default][file] = "./*.sql" # must match only one file
sites[0][database][default][driver] = "mysql"

So the root directory of the tarball contains (1) a directory containing a Drupal docroot, e.g. index.php etc. and (2) a .sql file.

This actually may not be the exact format Gardens currently exports, but Gardens will change to match whatever format we decide on.

No, you're right. My

ronan's picture

No, you're right. My explanation was reflecting my ongoing misunderstanding of tarballs. I always forget that a the tarball structure does not have an implied base directory. I think my brain refuses to accept that '../../../somefile.txt' can technically be added to a tarball :). All of my descriptions assume that the tarball contains 1 and only 1 directory and that that directory contains the MANIFEST.txt (if there is one) and all other files are either in or bellow that directory.

Unless any existing formats don't follow this already, then I'd say it makes sense to make that assumption part of the spec. It seems like good manners to have a single base directory (with whatever name the generator chooses) in the tarball. The MANIFEST should be inside that directory, and all paths listed in the manifest should be relative to the MANIFEST, not to the root of the tarball. An archive would also need to have a single base directory and the default MANIFEST I've listed above would assume that '.' is that directory. I believe this describes the distro format as well as the Gardens exports (at least it describes the one I downloaded from my site) and the format described here: https://wiki.getpantheon.com/display/PANTHEON/Importing+Existing+Sites

Unless there's a good reason to allow anything other than a single directory at the base of the tarball then adding this requirement to the spec I believe puts is back on the same page. It also removes the need for a consumer to have to resolve paths such as './/.sql' and '[docroot]/xxx'.

For what it's worth, my POC implementation creates archives which match what you describe.

$ tar tzf archiver-2011-02-07T23-11-22-3.sitearchive.tar.gz | grep MANIFEST.txt
archiver-2011-02-07T23-11-22/MANIFEST.txt
$ tar tzf archiver-2011-02-07T23-11-22-3.sitearchive.tar.gz | grep index.php
archiver-2011-02-07T23-11-22/docroot/index.php

Sorry about the confusion. I'll try and keep this straight as we go on :)

We still aren't on the same

bjaspan's picture

We still aren't on the same page. I am suggesting that in the absence of a MANIFEST.txt file, the docroot is whatever top-level directory contains an index.php file. This is the example I gave:

$ tar tzf drupal-7.0.tar.gz | grep index.php
drupal-7.0/index.php

So "drupal-7.0" is a top-level directory in the tarball, and the files comprising the docroot live there. So drupal-7.0/includes/common.inc, drupal-7/modules/system/system.module, etc.

For an archive with a manifest file, I suggest it be at the top level, e.g.:

$ tar tzf foobar.tar.gz
./MANIFEST.txt
./whatever/path/the/manifest/says/index.php
... etc ...

By comparison, your example shows everything one level deeper down, which is different than the standard Drupal distro format.

We still aren't on the same

ronan's picture

We still aren't on the same page. I am suggesting that in the absence of a MANIFEST.txt file, the docroot is whatever top-level directory contains an index.php file.

Agreed.

For an archive with a manifest file, I suggest it be at the top level

That's fine by me. My preference would be for some sort of base directory to make extraction a little cleaner but it's not a big deal. If we're all in agreement that the MANIFEST.txt should be at the base level of the tarball then I'll make that change to my proof of concept code.

Custom File Extension

ronan's picture

Does anybody else have any opinion on using a custom file extension (in addition to the real file extensions, of course)? As a consumer it would be a nice benefit to be able to identify an archive by it's name without having to unpack it and inspect it's contents. Backup and Migrate uses file extensions to determine what has been uploaded and how to handle it on restore so this would be nice to have to distinguish the archives from the old tarballs B&M created. This would obviously not be absolutely required (in keeping with allowing existing distros to be grandfathered into the spec) but could maybe be highly encouraged for generators. I'm not aware of any downsides to adding an extension to the file other than consuming a few more characters of the filename limit.

This could also help 'brand' the idea a little for users too.

In my proof of concept I've added .sitearchive which is is pretty explanatory but a little lengthy. I've also suggested .drop and .drupal above (.archive and .site have other usages so should probably be avoided).

Any thoughts on making this a part of the standard?

I think a common extension is

bjaspan's picture

I think a common extension is a fine idea. Java uses .jar (which I think is in ZIP format). We could use .dar. However, I think we want to use gzip'ed tarballs since that is what everyone is doing already. .dar.tar.gz is too long. Just .dar loses the fact that it is gzip'ed. I guess we could use .dar or .dgz. I'm not sure how much value that adds.

Something longer is clearer but then kinda verbose.

Shrug. I have no clear opinion.

Drupal.org distribution file format discussion

anarcat's picture

See also this thread: http://drupal.org/node/914284 that talks about such formats for Drupal.org

Drush module?

jcsalem's picture

Moshe,

I went to download the Drush module and the link is broken now.

Any chance you could put it back and, perhaps, bring it up-to-date with Ronan's?

Jim

Stale thread?

pearcec's picture

Apparently this functionality already exists in 5.x?

Looks like bjaspan, among others, is making forward progress

http://drupal.org/node/1152020

Also looks like Acquia Dev Cloud planning on supporting this format. They claim it is available in Drush 4.5 which is interesting cause we don't have 4.5. They also have a download archive.drush.inc but it is empty.

--
Christian

Full on!

anarcat's picture

That's right - archive-dump was imported in 5.x and 4.x (so it will be in the next release, 4.5). Acquia will support this, from what I understand, and so will Aegir 2.0.

Indeed!

joshk's picture

We are excited to support this at Pantheon also.

If anyone can share their

moshe weitzman's picture

If anyone can share their restore code for a site archive, that would be swell. I need to write archive-restore for core drush since our unit test system will use site archives to cache built environments. Would be nice to start from someone else's working code.

I don't see any examples of a

moshe weitzman's picture

I don't see any examples of a nested array like sites[0][docroot] = "/Users/mw/htd/d7" being read by parse_ini_file(). When I do that, I get the error syntax error, unexpected TC_SECTION, expecting '=' in ./MANIFEST.info. It seems that one level of hierarchy is OK but not two. So, sites[docroot] is fine but sites[0][docroot] is not.

we could copy drupal's ini parser but WTF. We should try to be compatible with PHP's lame parser.

How about we use sections to delimit sites, and stick to one level with files-public and such? That parses properly. For example

[Global]
formatversion = .2
generator = "Drush archive-dump"

[Site 0]
sitedir = sites/default
files-public = sites/default/files

[Site 1]

+1

joshk's picture

I like this since assuming it solves the data structure issue it is also more human-readable than the alternative.

Also agreed that doing a more complex "drupal only" thing doesn't make any sense in this modern era. If it turns out we can't possibly keep compatibility with php's lame "limited" parse_ini_file() function, I would favore we go the full monty and just have these written in JSON. That should support any data structure we want, and if formatted prettily can be decent for hand-editing.

I'm okay with either

bjaspan's picture

I'm okay with either approach: sections, or json.

Mostly what I want is to have Drush 4.5 shipped with the archive-dump command included so people can at least start using it for single sites. If discussion about changing the manifest file format will hold that up, I suggest just removing the manifest file completely. That would mean we'd have to restrict the command to accepting a single-site alias, which would be fine with me for v1.

OK, I am changing

moshe weitzman's picture

OK, I am changing archive-dump to produce sections. FYI, the database items get a couple dashes. See last items below.

[Global]
formatversion = ".2"
generator = "Drush archive-dump"

[Site 14]
name = "Site-Install"
docroot = "/Users/mw/htd/d7"
sitedir = "sites/default"
files-public = "sites/default/files"
database-file-default = "./d7.sql"
database-driver-default = "mysql"

Just committed this change. I

moshe weitzman's picture

Just committed this change. I also changed the name of the file to MANIFEST.ini from MANIFEST.info.

Documentation?

helmo's picture

After all this discussion I was hoping to find a specification draft....

I opened an issue in the drush queue: http://drupal.org/node/1294632

Where do we work on this...

helmo's picture

Moshe thinks this should be hashed out in this group instead of a drush queue.

I agree that we need participation beyond just drush to get this in shape.

But it's awfully quiet around here...

I just made a bit of progress on the html file I started:
http://drupalcode.org/sandbox/helmo/1277350.git/blob_plain/refs/heads/ma...

Site Archive Format

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week