[ This group initially started as an email thread. This discussion is for re-posting the content of the thread and then continuing the discussion about the goals and layout of the archive format. ]
My initial idea is that the basic format will look like what Drupal Gardens already generates and what I think Pantheon imports: a gzip'ed tarball with code, ./files, and a .sql dump file in the root. However, there are many other things the format could support, such as:
- A specifiable format for the sql dump: mysqldump, pure SQL, whatever.
- Where the code lives (e.g. ./ or ./docroot), where the files live (./files, ./sites/default/files), where the database lives (ditto) within the tarball.
- Multi-site archives, with a separate database and files dump for a specified list of sites directories.
- A public key signature.
- Environment dependencies, e.g. PHP & MySQL version, PHP extensions, etc.
My thought is that the archive format should be gzip'ed tarball with an optional MANIFEST.txt file specifying all the above info. Every field in the manifest would have a default, and all the defaults taken together (which is the same as having no manifest) leads to the basic tarball+mysqldump format.

Comments
I'm not aware of any
I'm not aware of any particular solution to this problem but it has other hidden benefits in the deployment space. Well, at least in initial deployment. Anything beyond would be configuration management and best handled elsewhere in Drupal. From a demand perspective, this is high among developers -and- stakeholders so I'm happy to see this conversation taking shape formally.
My number one technical concern is making sure the export or process that creates the manifest has some knowledge of, or is able to be given knowledge of by way of the exporting system's cues, what should and should not be exported. At least in the Drupal Gardens sense, we don't send Theme Builder along with every export so we'd need to be able to tell the export that. Similarly, other providers (especially those where SaaS or other IP based on Drupal is concerned) will want to have the same conversation with its export.
From a purely functional standpoint, not having that ability and just sending out a "dumb" exportable (thanks to jbrauer for that term) could mean broken sites even in the non-imported sense. (As in, "Hey, Where the hell is module X this database says I have?") - It could also undermine the reason for this effort. Easy portability.
I'm more alluding to the fact that while an Export this Site button is great for the average implementation, others could benefit from an Export API which the exporting system could then leverage to create this format based on the specification the community provides.
Lastly, I'd be more inclined to see manifest.txt take on another, more known and accepted format for defining the export rather than another let's roll our own implementation (Remember Bones?).
Kenny Silanskas
Proud Member of the Acquia Support Team
Manifest destinty
Can we stick with the good old ini/.info format for the MANIFEST? There's a handy PHP function for parsing these and it's already something of a standard in Drupal.
More importantly, I want to second this:
I would love to see us (the community at large, but especially people who work on these sorts of wide-scale deployments) come up with a set of consensus Best Practices for the "right" way to structure an export.
This would also, by design, provide people with a "right" way to structure a running site's code tree. There are a lot of equally-functional ways to do this, which makes it confusing and tempts people to try and invent their own solution — not the greatest investment of creative energy IMHO. It's also something I see newcomers (even/especially those with a related background in software) stub toes on often.
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
My number one technical
Along the same lines, there are tables you will most likely want to leave out when dumping a database (such as the bloated cache_form and friends) though you do want to export their structure. Other tables should not be in the db dump at all (e.g. theme builder related tables maybe). In the Drush command dgb, I use drushrc.php much like a custom MANIFEST to tell mysqldump what tables to dump or ignore. It has some logic to handle a set of default tables to leave out using wildcards and ignoring non existing tables. These features might land in Drush 4.
Obviously I'm glad to see some initiative in standardizing a site archive format. While I'm interested in making sure it integrates with a VCS like git (I use it for backup and deployment), I do see some big benefits for making tarballs too.
Anyone care to take a stab at
Anyone care to take a stab at an example MANIFEST file? I agree that .info format makes sense.
Has anybody put any more
Has anybody put any more thought into this? I like this initiative and I hope it keeps moving forward. Here are some of my reactions:
1) Ditto on .info format.
2) I'd favor standardization over specification. In other words, rather than allowing the author to specify the location of the db dump, simply require it to be at the top level of the docroot, perhaps with a standardized name or naming convention. There should be some flexibility but IMHO we don't need variation just for the sake of variation. That will make it much easier to write code that consumes these archives.
3) Would it be useful to have an additional file extension or another way to identify a site archive without having to untar it and inspect it's contents? I don't think an additional extension before the tar/tgz/zip extension would do any harm (ie: sitename.xxx.tar.gz). Some ideas for file: .drupal, .drop
4) Do we even need to account for multi-site archives? It seems like the purpose of a site archive is to package up a single site into a reusable format. Multi-site is really just a factor of how that site is hosted and maybe not relevant to this problem. Having the possibility of multiple sites in a single archive or having a site not in 'sites/default' makes consuming these archives quite a bit more complicated and reduces the portability of the archive. I wouldn't be against a variation of the archive format that can store multiple sites in one, but I'm not sure the specifics of the multi-site setup are particularly relevant even then. I might be off-base though since I don't use multi-site much myself.
Proposed MANIFEST format
Ok, I'll take a have a go at formatting the manifest file:
MANIFEST.info
; Some brief intro could go here for any human who is curous enough to open this file.
; Archive information
datestamp = "1291500673"
formatversion = "1.0"
generator = "Module/Service Name"
generatorversion = "7.x-2.3"
; Descriptive information set by the user who created the archive.
description = "A user definable description of the archive."
tags[] = "tags"
tags[] = "defined by user"
; Site specific information
sites[0][version] = "7.1"
sites[0][name] = "Site Name Here"
sites[0][docroot] = "path/to/docroot"
sites[0][sitedir] = "path/to/site/directory"
sites[0][files][public] = "path/to/public/files"
sites[0][files][private] = "path/to/private/files"
sites[0][database][default][file] = "path/to/dbdump.sql"
sites[0][database][default][driver] = "mysql"
sites[0][database][another][file] = "path/to/otherdbdump.sql"
sites[0][database][another][driver] = "sqlite"
; Second site in multisite configuration
sites[1][version] = "7.1"
sites[1][name] = "Other Multi-site Name"
sites[1][sitedir] = "path/to/othersite/directory"
sites[1][docroot] = "path/to/other/docroot"
sites[1][files][public] = "path/to/public/files"
sites[1][files][private] = "path/to/private/files"
sites[1][database][default][file] = "path/to/onemoredbdump.sql"
sites[1][database][default][driver] = "mysql"
And if there is no MANIFEST.info we assume the following defaults
formatversion = "1.0"sites[0][docroot] = "./"
sites[0][sitedir] = "./sites/default"
sites[0][files][public] = "./sites/default/files"
sites[0][database][default][file] = "./database.sql"
sites[0][database][default][driver] = "mysql"
Which roughly equates to the existing de-facto tarball+mysqldump format. It differs from what Drupal Gardens currently exports because it specifies a standard name for the database dump while Gardens uses 'sitename.sql'. We could account for this by allowing wildcards and making the default:
sites[][database][default][file] = "./*.sql"But that makes the job of the archive consumer a little bit harder. I suspect that most consumers would want to implement this feature whether it's a part of the standard or not since Gardens' archives are an important precedence.
The format allows for specifying both public and private files for D7, but since D6 only allows for one or the other and not both, that needs to be handled differently. We coud say that D6 archives always specify
sites[][files][public]for the files directory regardless of the actual file mode set for the site since it probably doesn't matter and it makes the job of the consumer easier. It makes the format a little misleading though. If people feels strongly against this, then I'd recommend:sites[0][filemode] = "private"sites[0][files][private] = "path/to/files"
or
sites[0][filemode] = "public"sites[0][files][public] = "path/to/files"
To keep the job of the consumer easier. This would add:
sites[0][filemode] = "public"to the implied default.
I have eased-off on point 4 from my last post and added multi-site to this proposal since it's easier to specify it and allow a consumer that didn't support multi-site to just read from
sites[0]*. I've also completely reversed my position on point 1 and gone with way-over-specification instead of standardization for this proposal. The reason for this is that I wanted to maintain the current tarball+mysqldump standard but also allow for the layout I would favour which would be something more like:MANIFEST.infodocroot/
authorize.php
CHANGELOG.TXT
COPYRIGHT.txt
cron.php
...
database/
somedatabasefile.sql
anotherdatabase.sql
I prefer this layout since it keeps the database and manifest out of the codbase so it reduces the chance that a user will accedientally upload these files to the root of their site and expose their data to the public.
I didn't address all of Barry's ideas for the manifest becuase I couldn't really envision how all that info would be added, so feel free to weigh in on other stuff to add. Whatever we put in the spec, we should allow archive generators to add additional fields that they feel might be useful and allow consumers to specify additional optional fields that have some meaning for that consumer. We could specify a required prefix or section for these custom fields if we think it necessary, but there will probably be few enough consumers and generators that collisions may not be a huge worry.
Thanks for reading all the way through this long post. Please weigh in with comments, suggestions and criticisms.
Excellent.
Ronan, this is excellent and exactly the kind of thing I had in mind. Detailed comments coming.
Here are some initial
Here are some initial comments.
We should clarify the distinction between a "docroot" (codebase) and a "(multi-)site" (files, db, and settings.php file); I think Aegir uses the terms "platform" and "site" for this. In your example file, you have sites[0] with a docroot, files, and db, and sites[1] with the same, which makes it awkward to specify one docroot with multiple sites.
So, what if we could specify multiple docroots, then relate each site to a docroot, e.g.:
docroots[0][version] = '7.x'
docroots[0][path] = 'path/to/docroot'
; optional info
docroots[0][distro] = 'Drupal|Pressflow|Drupal Commons|OpenAtrium|...'
docroots[0][description] = 'My code base'
docroots[1]...
sites[0][docroot] = 0
sites[0][name] = "default"
sites[0][files]...
sites[0][database]...
sites[1][docroot] = 0
sites[1][name] = "mydomain.com"
sites[1][files]...
sites[1][database]...
sites[2][docroot] = 1
In other words, the sites[] section is basically laying out the sites directory.
For the default (no-MANIFEST) settings, I suggest:
These rules are easy to implement; Acquia Hosting's site import code already does, and in fact our "Site install from Distro" code just uses the site import logic on a locally cached copy of the standard distro tarball.
I like the ability to specify [files][public] and [files][private] separately.
I don't think the archive format needs to worry about preventing a D6 archive from specifying public and private files. For all you know, someone is using a contrib module that supports that.
So, what if we could specify
I would argue that this unnecessarily complicates the parsing of and consumption of an archive. Having multiple different Drupal bases in one archive is probably a rare edge-case (I can't personally think of why one would generate this or how one would practically consume it, Drupal multi-site certainly doesn't allow mixing of d6 and d7 sites) and repeating the same base info for each 'site' in the archive doesn't seem like a big deal, especially since these files won't likely be built or edited by hand.
I like the addition of 'distro' as long as it's optional (since I'm not aware of any way to reliably tell, programatically, what distribution you are running).
Do we need to establish some sort of rules or logic for a consumer to implement to determine what is a Drupal docroot? It may not matter since any producer concerned with ambiguity will provide a manifest. And any producer creating archives that are structurally ambiguous and lacking a manifest can expect their archives not to work for most consumers. The defaults should be there just to allow consumers to be backwards compatible with a few known pre-existing archive types (distro tarballs, Garden's exports, etc.)
Agreed a producer should have the option to include/exclude whatever parts (of codebase, db, site files) it wants.
That's a cool feature. I may be misunderstanding the directory structure of tarballs, but doesn't ("./*/index.php") mean that your docroot has to be one directory down from the root of the tarball? Wouldn't that mean that Drupal tarballs are not valid archives? If I misunderstanding the terminology, then I agree completely, and the paths in my last post are not what I meant at all. I was thinking that a site archive tarball would expand to a single directory and that directory (like a Drupal tarball or Gardens export does) would be considered '.' in the paths I listed above.
I'm fine with that as long as a consumer with no knowledge of that hypothetical contrib module knows where to find the 'normal' D6 Drupal files directory (be it specified as public or private by configuration) within the archive. The use case I'm thinking of here is the abliity to restore/import just the files from a site archive to the current site's files directory.
Also when exporting from a D6 site an archive generator probably only knows about the files directory specified in the config and has to label that directory in the manifest one way or the other. The second of my previous suggestions (specifying as a separate value whether the main files directory is public or private and having that value be the key for the path value) will parsing easy and still allow additional files directories to be added for consumers that know what to do with them. This is trivial for a generator it has access to these drupal settings at creation time (eg: the generator is a module running on the site being archived) if this info is not available (eg: external archiving application that doesn't want to peek into the db), then the generator could be allowed to just make one up:
sites[0][filemode] = 'unknown'sites[0][files][unknown] = 'path/to/files/dir'
Or maybe even ommit the path to the files dir (if it doesn't know that either). A consumer would then be under no obligation to specifically address the files directory unless it exists at the default location ('./sites/default/files').
db logging format
I'm a MAJOR fan of the following mysqldump arguments:
-Q --add-drop-table -c --order-by-primary --extended-insert=FALSE --no-create-db=TRUEThe mean reason I use this is that it makes each INSERT statement its own separate & completely valid & debuggable statement, which means you can grep the file for specific values. It might make the file larger, but the greatly improved usability of the SQL file is well worth it.
Are you suggesting the
Are you suggesting the mysqldump options should be specified as part of the manifest file or the specification? I'd say not. Certainly there are good reasons to choose some options over others, but whatever program generates the archive file should get to choose (possibly under human control).
Probably there should be a place in the manifest data structure to specify the mysqldump (or other db export) options used so they can be displayed by a UI along with all the other data associated with a file to imported.
I agree. I don't think that
I agree.
I don't think that we should specify exact formatting for the dump files other than to require that they live up to some usable standard. For mysql, for example, they should be a series of valid MySQL commands or comments (and make the ability to use '.' to read them with the MySQL command line as the pass/fail test for compatibility, say).
We should have similar simple litmus tests for other DB drivers.
Recording those options in the manifest certainly doesn't hurt though and will at very help those who write archive-consuming software to debug.
I just committed a new
I just committed a new archive-dump command to drush core. It pretty much implements what Ronan proposed. See http://drupalcode.org/viewvc/drupal/contributions/modules/drush/commands...
Yeah, I'm up for a --no-core
Yeah, I'm up for a --no-core option. Easy to add. Does sites/all go into the Aegir tarball? anarcat makes it sound like it is just sites/example.com
For those who can't seem to try out archive-dump for themselves, http://cyrve.com/a.tar.gz is the site archive for my test drupal 6 site. also see transcript below to see how the sausage gets made. http://cyrve.com/MANIFEST.txt is the MANIFEST.info for that archive.
Comments welcome.
Correction -
Correction - http://cyrve.com/MANIFEST.txt is the manifest file.
Aegir tarballs do not have
Aegir tarballs do not have sites/all, by definition (they are done relative to sites/example.com). A compatible aegir dump would indeed be just sites/example.com, although this does create problems for us - if, for example, you want to clone site example.com to clone.example.com on the same platform, the backup format will be problematic because you need to restore to a remporary directory and move it in place instead of just untarring within sites/clone.example.com.
That's a general problem with the new form - you can't just untar them in place (since you may be changing the site name) - but I feel this is something that can be worked around. It's also a feature because it documents the old_uri of the site (in case you change the site name, Aegir fixes your files/node_revision/etc tables to follow the new urls...)
Moshe, you rock. :-) Replies
Moshe, you rock. :-) Replies to your comments:
As per my reply to anarcat below, perhaps we should just skip multiple docroot support at this point, i.e.: in version 1, make docroot (location of the code) a global, not per-site, setting. Until we have a clear use-case, simplicity is better.
Well, we do have the ability to specify database format. I guess for "stuff like mongodb" we'd need multiple databases per multi-site. Frankly this can be useful for pure mysql as well, some sites use more than one db. I'd be happy to include this or skip it for v1.
ISTM that in general symlinks should be resolved during generation; the goal is to create a self-contained archive. The generator can/should certainly have an option to not resolve them.
I think that archive-restore is going to be very environment-specific. On Acquia Hosting, for example, based on a site's configuration, we need to put the code into a repo, the files into a network filesystem, and the database(s) onto the correct servers; then we need to modify the appropriate settings.php files to use our "settings include file" that manages the db connections, failover, etc. (Incidentally, I've written this importer already as a drush command, we use it for our Site Import functionality.) Even in a simple environment, importing will require knowing the correct docroot path. I guess we could write a simple importer that accepted args for docroot path and db name/creds. Should the standard importer modify settings.php?
I presently have no opinion re: drush core or not.
there's already provision-backup and provision-restore out there
Just a quick note to mention that aegir already has a metadata format. It doesn't have a manifest file, because it doesn't need any - a backup is from a specific "platform" (a drupal code instance), and can be deployed on other instances with the provision-deploy commands, which readily runs drush updatedb for you if necessary.
In aegir backups, each site is in its own tarball. The tar is created as if you were within the sites/example.com directory so you have:
files/
modules/
themes/
settings.php
database.sql
... in the tarball.
The only thing missing to make this really portable is a platform descriptor. In Aegir, we're trying to focus on makefiles so that we have a clear idea (instead of a bunch of random code) of what the site is (openatrium, acquia, etc). I was thinking of adding a .make to that backup to make sure that you can replicate the platform the site was sitting in (could be created with drush generate-make if drush make is around). The problem is that then you basically rely on drush make for portability, which may not cover all use cases (especially if you have custom, non-distributed modules).
So the proposal here is interesting for us to implement what we call exportable backups.
I'll try to think more about the proposed format, but right now, these are the issues I can think of:
That's all I can think of right now. I am sorry I didn't see that thread earlier.
Answers
The manifest file is useful for a number of reasons. The db dump may be mysql, or sqlite3, or mongo, or multiple different dbs, etc. The files directory may not be stored in sites/*/files. We should probably provide some form of signature for verification. Even just displaying meta-data during import is pretty useful.
It's fine for the archive format to support not containing core. This requires specifying the exact platform (in Aegir-speak) that is expected if the import is going to be portable to a non-Aegir environment, which is another useful bit of information for the manifest file. :-)
Actually I'm not sure that we do need multiple docroots in one archive. Looking at my original post on this thread I certainly did not seem to envision it. Multiple multi-sites per docroot is very important, though.
I would strongly encourage
I would strongly encourage not having multiple docroots, so I am happy to feel that we're in agreement here.. This will make everything clearer and simpler.
For the DB types, can't the file extensions be sufficient for decribing the data (*.mysql...). Just a guess... I wonder if we could have portable ANSI dumps... ;)
We do not absolutely need to specify the parent platform - sometimes that's just impossible to do when it's a really custom platform, maybe in a makefile ...
It sounds like aegir's format
It sounds like aegir's format would work as a site archive by just adding this manifest.txt file to the root of tarball:
datestamp = "x"
formatversion = "xx"
generator = "Aegir"
generatorversion = "??"
sites[0][docroot] = ""
sites[0][sitedir] = "./"
sites[0][database][default][file] = "database.sql"
sites[0][database][default][driver] = "mysql"
I'm all for having each of the components (root, site, files, db) be optional, but we should probably have a standard for specifying a none (I've used empty string above) since the implied defaults assume a certain value.
I'd say it's probably ok to allow/disallow multiple docroots and multiple sites in the format as long as we don't require generators or consumers to support more than one at a time (ie: if you import a site with 'Bob's Drupal Builder' it can just pull the first site listed in the MANIFEST if that's all that makes sense). Both of the manifest formats discussed above allow but don't require multiple sites and different docroots. The cost is repeating the docroot path (and maybe version etc.) once for each site in the archive, so we're talking about a couple of extra bytes in a compressed tarball. Hopefully this flexibility will allow us to have a backwards and forwards compatible v.2 of the format if and when somebody comes up with a use-case for multiple roots etc.
Sure, but since that info is probably available at the time of generation it should be trivial to add it to the manifest. That'll make consumption that much easier.
default manifest data
I like that manifest - can we make that the default? ;)
Basically, could we say that a file without a manifest is the builtin aegir format? That would make my life a whole lot simpler...
Maybe aegir and acquia can
Maybe aegir and acquia can arm-wrestle for the default :)
Makefile as code backup
@anarcat also brings up a possibility that had occurred to me as a possible future feature for the format. The idea of using a makefile or makefile formatting to specify the code that the site is made of rather than (or in addition to) adding the code itself to the archive.
It certainly won't cover all use cases, and requires a knowledge of how a site is built that most generators won't have (like whether modules have been patched, where to find non-contrib modules etc.) so it's probably a pretty complicated thing to build. Since well over 90% of the code in most drupal installs is available through public cvs/svn/git somewhere this could be a nice way to make archives more efficient. There'd be a lot of details to work out though.
I don't think this is practical to add to a v.1 of the spec, but it's a cool possibility for later versions.
Another generator
Well a spec is can't be complete until there are 3 independent implementations, so here's mine:
https://github.com/ronan/backup_migrate_archive
It's a pretty rough proof of concept and it requires Backup and Migrate (and only works on D6). It's also a PHP-only solution (no command line tar or mysqldump) so it should run just about anywhere but only on small sites.
Sample output at http://gortonstudios.com/archiver-2011-02-07T23-11-22.sitearchive.tar.gz
Nice work! A couple
Nice work! A couple differences I noticed from my implementation.
You put all the code under a docroot directory. That makes sense. So shall we abandon the idea that default drupal distro is an archive? Thats the only reason to put code in the root of the archive.
You used the name MANIFEST.txt but your spec says MANIFEST.info :)
You put all the code under a
I actually like that the format allows either layout and that the manifest makes it trivial for a consumer to support either one. I think it's still valuable to have the drupal distro be an archive (with a an implied manifest specifying that the docroot is at the root of the archive). This also allows the current Gardens export and Pantheon import formats to be de-facto site archives--making Gardens our 3rd generator implementation for v0.1, and Pantheon our first consumer :). The Aegir format can be an archive by the addition of a manifest file.
Ooops :). Fixed that on github.
Existing support for site archives
"This also allows the current Gardens export and Pantheon import formats to be de-facto site archives--making Gardens our 3rd generator implementation for v0.1, and Pantheon our first consumer :)."
FYI, site import using the current de-facto format (for uploaded tarballs and Drupal distros) has been supported live, publicly available, on Acquia Hosting since December. See http://acquia.com/blog/importing-drupal-site-acquia-hosting where I demo exporting from Gardens and importing into Acquia Hosting. Sorry, just couldn't let this pass by. :)
At Drupalcon Chicago we'll be announcing Acquia DevCloud, our developer-focused Drupal hosting offering. I'll be very happy to announce official import and export support for the Drupal Site Archive format if it is ready by then. The fact that there is already a Drush export command and our internal import command written makes it pretty easy. :-)
FYI, site import using the
FYI, site import using the current de-facto format (for uploaded tarballs and Drupal distros) has been supported live, publicly available, on Acquia Hosting since December. See http://acquia.com/blog/importing-drupal-site-acquia-hosting where I demo exporting from Gardens and importing into Acquia Hosting. Sorry, just couldn't let this pass by. :)
I stand corrected :) So that makes 2 real-world importers and 2 proof-of-concept generators (and a bajillion de-facto archives in the wild). Pretty good.
Nice. I'm looking forward to learning more.
I really like the idea of
I really like the idea of allowing the standard distro format to also be compatible with the standard site archive format; it has a nice sense of symmetry and it also means all archive importers will be able to import all existing Drupal distros with no further effort. It's fine for backup-and-migrate to put in an extra level, but is their an advantage to making that the default?
I agree, the assumed default
I agree, the assumed default should be the format I described above:
formatversion = "1.0"sites[0][docroot] = "./"
sites[0][sitedir] = "./sites/default"
sites[0][files][public] = "./sites/default/files"
sites[0][database][default][file] = "./*.sql"
sites[0][database][default][driver] = "mysql"
Which more or less describes the standard distro package (except for the .sql dump) and exactly matches the Gardens import/export format and the Pantheon import. Generators would be free to change this structure as needed as long as they specify the structure in a MANIFEST file.
Either I don't understand
Either I don't understand what you are suggesting or we don't agree. The standard Drupal distro format includes a top-level directory with an arbitrary name which contains the docroot directly (not in another subdir). For example:
$ tar tzf drupal-7.0.tar.gz | grep index.phpdrupal-7.0/index.php
So, shouldn't the default manifest file be:
formatversion = "1.0"sites[0][docroot] = "./*" # must match only one dir containing index.php
sites[0][sitedir] = "[docroot]/sites/default"
sites[0][files][public] = "[docroot]/sites/default/files"
sites[0][database][default][file] = "./*.sql" # must match only one file
sites[0][database][default][driver] = "mysql"
So the root directory of the tarball contains (1) a directory containing a Drupal docroot, e.g. index.php etc. and (2) a .sql file.
This actually may not be the exact format Gardens currently exports, but Gardens will change to match whatever format we decide on.
No, you're right. My
No, you're right. My explanation was reflecting my ongoing misunderstanding of tarballs. I always forget that a the tarball structure does not have an implied base directory. I think my brain refuses to accept that '../../../somefile.txt' can technically be added to a tarball :). All of my descriptions assume that the tarball contains 1 and only 1 directory and that that directory contains the MANIFEST.txt (if there is one) and all other files are either in or bellow that directory.
Unless any existing formats don't follow this already, then I'd say it makes sense to make that assumption part of the spec. It seems like good manners to have a single base directory (with whatever name the generator chooses) in the tarball. The MANIFEST should be inside that directory, and all paths listed in the manifest should be relative to the MANIFEST, not to the root of the tarball. An archive would also need to have a single base directory and the default MANIFEST I've listed above would assume that '.' is that directory. I believe this describes the distro format as well as the Gardens exports (at least it describes the one I downloaded from my site) and the format described here: https://wiki.getpantheon.com/display/PANTHEON/Importing+Existing+Sites
Unless there's a good reason to allow anything other than a single directory at the base of the tarball then adding this requirement to the spec I believe puts is back on the same page. It also removes the need for a consumer to have to resolve paths such as './/.sql' and '[docroot]/xxx'.
For what it's worth, my POC implementation creates archives which match what you describe.
$ tar tzf archiver-2011-02-07T23-11-22-3.sitearchive.tar.gz | grep MANIFEST.txtarchiver-2011-02-07T23-11-22/MANIFEST.txt
$ tar tzf archiver-2011-02-07T23-11-22-3.sitearchive.tar.gz | grep index.php
archiver-2011-02-07T23-11-22/docroot/index.php
Sorry about the confusion. I'll try and keep this straight as we go on :)
We still aren't on the same
We still aren't on the same page. I am suggesting that in the absence of a MANIFEST.txt file, the docroot is whatever top-level directory contains an index.php file. This is the example I gave:
$ tar tzf drupal-7.0.tar.gz | grep index.phpdrupal-7.0/index.php
So "drupal-7.0" is a top-level directory in the tarball, and the files comprising the docroot live there. So drupal-7.0/includes/common.inc, drupal-7/modules/system/system.module, etc.
For an archive with a manifest file, I suggest it be at the top level, e.g.:
$ tar tzf foobar.tar.gz./MANIFEST.txt
./whatever/path/the/manifest/says/index.php
... etc ...
By comparison, your example shows everything one level deeper down, which is different than the standard Drupal distro format.
We still aren't on the same
Agreed.
That's fine by me. My preference would be for some sort of base directory to make extraction a little cleaner but it's not a big deal. If we're all in agreement that the MANIFEST.txt should be at the base level of the tarball then I'll make that change to my proof of concept code.
Ok, that change is up at:
Ok, that change is up at: https://github.com/ronan/backup_migrate_archive
New example archive at: http://ronandowling.com/archiver-2011-02-14T22-37-11.sitearchive.tar.gz
Custom File Extension
Does anybody else have any opinion on using a custom file extension (in addition to the real file extensions, of course)? As a consumer it would be a nice benefit to be able to identify an archive by it's name without having to unpack it and inspect it's contents. Backup and Migrate uses file extensions to determine what has been uploaded and how to handle it on restore so this would be nice to have to distinguish the archives from the old tarballs B&M created. This would obviously not be absolutely required (in keeping with allowing existing distros to be grandfathered into the spec) but could maybe be highly encouraged for generators. I'm not aware of any downsides to adding an extension to the file other than consuming a few more characters of the filename limit.
This could also help 'brand' the idea a little for users too.
In my proof of concept I've added .sitearchive which is is pretty explanatory but a little lengthy. I've also suggested .drop and .drupal above (.archive and .site have other usages so should probably be avoided).
Any thoughts on making this a part of the standard?
I think a common extension is
I think a common extension is a fine idea. Java uses .jar (which I think is in ZIP format). We could use .dar. However, I think we want to use gzip'ed tarballs since that is what everyone is doing already. .dar.tar.gz is too long. Just .dar loses the fact that it is gzip'ed. I guess we could use .dar or .dgz. I'm not sure how much value that adds.
Something longer is clearer but then kinda verbose.
Shrug. I have no clear opinion.
Drupal.org distribution file format discussion
See also this thread: http://drupal.org/node/914284 that talks about such formats for Drupal.org
Drush module?
Moshe,
I went to download the Drush module and the link is broken now.
Any chance you could put it back and, perhaps, bring it up-to-date with Ronan's?
Jim
@Jim http://drupalcode.org/pr
@Jim
http://drupalcode.org/project/drush.git/blob/HEAD:/commands/core/archive...
Stale thread?
Apparently this functionality already exists in 5.x?
Looks like bjaspan, among others, is making forward progress
http://drupal.org/node/1152020
Also looks like Acquia Dev Cloud planning on supporting this format. They claim it is available in Drush 4.5 which is interesting cause we don't have 4.5. They also have a download archive.drush.inc but it is empty.
--
Christian
Full on!
That's right - archive-dump was imported in 5.x and 4.x (so it will be in the next release, 4.5). Acquia will support this, from what I understand, and so will Aegir 2.0.
Indeed!
We are excited to support this at Pantheon also.
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
If anyone can share their
If anyone can share their restore code for a site archive, that would be swell. I need to write archive-restore for core drush since our unit test system will use site archives to cache built environments. Would be nice to start from someone else's working code.
I don't see any examples of a
I don't see any examples of a nested array like
sites[0][docroot] = "/Users/mw/htd/d7"being read by parse_ini_file(). When I do that, I get the errorsyntax error, unexpected TC_SECTION, expecting '=' in ./MANIFEST.info. It seems that one level of hierarchy is OK but not two. So, sites[docroot] is fine but sites[0][docroot] is not.we could copy drupal's ini parser but WTF. We should try to be compatible with PHP's lame parser.
How about we use sections to delimit sites, and stick to one level with files-public and such? That parses properly. For example
[Global]
formatversion = .2
generator = "Drush archive-dump"
[Site 0]
sitedir = sites/default
files-public = sites/default/files
[Site 1]
+1
I like this since assuming it solves the data structure issue it is also more human-readable than the alternative.
Also agreed that doing a more complex "drupal only" thing doesn't make any sense in this modern era. If it turns out we can't possibly keep compatibility with php's
lame"limited" parse_ini_file() function, I would favore we go the full monty and just have these written in JSON. That should support any data structure we want, and if formatted prettily can be decent for hand-editing.https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
I'm okay with either
I'm okay with either approach: sections, or json.
Mostly what I want is to have Drush 4.5 shipped with the archive-dump command included so people can at least start using it for single sites. If discussion about changing the manifest file format will hold that up, I suggest just removing the manifest file completely. That would mean we'd have to restrict the command to accepting a single-site alias, which would be fine with me for v1.
OK, I am changing
OK, I am changing archive-dump to produce sections. FYI, the database items get a couple dashes. See last items below.
[Global]
formatversion = ".2"
generator = "Drush archive-dump"
[Site 14]
name = "Site-Install"
docroot = "/Users/mw/htd/d7"
sitedir = "sites/default"
files-public = "sites/default/files"
database-file-default = "./d7.sql"
database-driver-default = "mysql"
Just committed this change. I
Just committed this change. I also changed the name of the file to MANIFEST.ini from MANIFEST.info.
Documentation?
After all this discussion I was hoping to find a specification draft....
I opened an issue in the drush queue: http://drupal.org/node/1294632
Where do we work on this...
Moshe thinks this should be hashed out in this group instead of a drush queue.
I agree that we need participation beyond just drush to get this in shape.
But it's awfully quiet around here...
I just made a bit of progress on the html file I started:
http://drupalcode.org/sandbox/helmo/1277350.git/blob_plain/refs/heads/ma...