Building Drupal projects on Git

joshk's picture

Building a Drupal project using git is different than building Drupal itself, and requires its own workflow. I've been kicking ideas back and forth with Sam Boyer lately about how to make this process take advantage of all the Git power, but also be newbie-safe and as frictionless as possible. I think what we've come up with is pretty good: there's even code written! The process I am going to explain allows the following:

  • Git-based updates for Drupal core and contrib
  • The ability to patch/tweak core/contrib without the complexity of vendor-branches
  • Portability for local development or Git-based deployment
  • Unrestrained custom development: feature branches, tags, multiple repos
  • Safe patterns that minimize conflicts and provide a clear resolution process

Pretty cool, eh? Expect a larger manifesto post from Sam in the near future, but for now here's where we are heading.

Using Two Git Remotes: upstream and collab

In a vanilla git situation, cloning an existing repository creates a single remote called origin, allowing you to push and pull code to/from that remote. However, a single origin isn't going to be enough to enable a best-practice Drupal on Git workflow. Why? Well, any good Drupal project should start with a clone of the canonical Drupal core project repository (or a repo that tracks core like Pressflow). This lets you pull updates for security and easily contribute back any innovations you make. Anything else is starting off on the wrong foot.

However, this immediately creates a problem because any project is going to need to add code in addition to core. Unless you're Dries, webchick (or davidstrauss) you're not going to be able to push back to the single origin, meaning you can't work as part of a team or use the power of git in any deployment workflow. No bueno.

Luckily this is something git was explicitly built to handle. The answer is to take a small step beyond the vanilla git workflow, and create two remotes: one for upstream, and one for collab. As you might have guessed, you'll use upstream as a "pull-only" source to get updates and make patches, while collab will hold your custom modules or themes, and allow you to work with a team and implement git-powered workflows.

The actual host for collab can be anything. You could use your own private repository, github (public or private), or even a drupal.org hosted sandbox if you don't mind your work being completely public. Likewise, your upstream could be any valid source. Drupal core from drupal is always a good choice, but any repository that starts with the canonical drupal history is also valid. You might want to use Pressflow, or maybe your team maintains its own "drupal-plus" repository (e.g. a distribution or quick-start set) which you use as the upstream for projects.

Contrib Modules as Git Submodules

A best-practice workflow will follow the same pattern for Drupal contrib as we've described for core: allowing project builders the ability to pull upstream updates and easily contribute back their changes if they want. There's a problem though: git.drupal.org necessarily separates every contrib module into its own repository. If your project started off as a clone of Drupal core, how can you include a separate repository for Views?

The answer is Git submodules, which are designed to handle this specific problem. However, these are an advanced feature, and it's important for us to have a consistent pattern for using them.

Luckily the use-case for contributed modules and themes is consistent, and the commands you'll need to add them as Git submodules — as well as updating them, — are the same every time. In the event that you need to apply a patch or make an enhancement ahead of the upstream maintainer, the same process for adding a collab remote will work just the same.

Custom Development in collab

The particular development for your project happens directly in the primary repository, and is tracked in the collab remote. This lets you work with a team, taking full advantage of feature-branching, local development, and branch/tag-based deployment workflows. With the small change of using collab where you're used to using origin, the git workflow of git checkout, add, commit, pull and push works the same as ever.

This also means you should be able to use your favorite Git GUI or other power tools with no problems.

The only complication here is the case where you have multiple developers who are adding Git submodules as per above. In that case, in addition to pulling code from collab as usual, it is necessary to run the git submodule update command, and potentially rebase your code if you've added the same submodule as someone else and have a tree conflict.

Visualize It

Only local images are allowed.

In this version we have the sitename project residing in /var/www/drupal on a server. It's main upstream is the official git.drupal.org/project/drupal.git and its main collab is on github at sitename/drupal.git.

Additionally, we have added views, wysiwyg and jquery_ui as submodules from drupal.org, and the tinymce library from github. We have created a collab repo for jquery_ui because we needed to update some of its libraries.

The sites custom module(s) and theme are stored in the primary collab repository.

Introducing dog

The workflow described above is safe and solid, but running all those git operations is a practical nightmare:

  1. Repetitive stress injury is no laughing matter.
  2. Missing a step or making a typo means you're at risk.
  3. Git has you all the info you need, but quickly assessing the status of a complex repository is a multi-step process.

As I'm fond of saying, human beings are really bad at repetitive rote tasks. It's not what we evolved to do, and we're unhappy and error-prone when subject to those conditions. Computers, on the other hand, love repetition and rote tasks. So let's make the robots do the $&*%'ing work!

dog = a Drush extension for "Drupal on Git"

The Drupal project already has a wonderful robot helper tool in Drush. Since the patterns we are describing are completely regular, this is a perfect use-case. Better yet, code is online here:

http://drupal.org/project/dog

Contributions are encouraged. As of right now, here's what Dog is specced to do for you:

dog-init [--upstream] [--branch] [--collab] [<directory>]

Initializes a new local project repository for building Drupal on Git.

  • --upstream defaults to latest major stable branch (e.g. Drupal 7.x), accepts drush dl style shorthand for drupal.org sources, or a full git url for using non-drupal.org remotes.
  • --branch local branch name; defaults to master
  • --collab remote collab repository; defaults to upstream
  • <directory> where to make the repository locally; defaults to the repository name of upstream

Example: drush dog-init --upstream=6.x --collab=git://github.com/joshk/my-drupal-project my-new-drupal6-project

dog-dl [--collab] [<project>] [<destination>]

Downloads a contrib from Druapl, sets up submodule and updates the main collab repo with the new information.

  • --collab optional collab repo for this contrib. Necessary if you don't have write access to the drupal source and intend on making local changes. Can be added later.
  • <project> project from drupal in drush dl style; also accepts a full git uri to support non-drupal remotes
  • <destination> destination for the module/theme; defaults to sites/all/modules or sites/all/themes

Example: drush dog-dl views-6.2.x

dog-collab [<uri>] [<directory>]

Add a new collab remote to a module, theme or main repository if one was not set up initially.

  • <uri> the location of the collab remote
  • <directory> path to the module or theme directory, or drupal rood; defaults to current working directory

Example: drush dog-collab git://github.com/joshk/my-views-patches sites/all/modules/views

dog-catchup

Pulls collab updates and automatically brings new submodules in/up to date.

Example: dog-catchup

dog-upstream-update [<directory>]

Pulls upstream updates and commits them to the collab remote if one exists.

  • <directory> optionally specify a directory to update; defaults to current working directory and works recursively.

Example: drush dog-upstream-update /sites/all/modules/views

dog-status

Parses main repository and submodule status and presents an overview of the entire project.

Possible alias: dog-vet

dog-remove [<directory>]

Completely removes contribs added via dog-dl and pushes that change to collab

  • <directory> directory of contrib to remove

Example: drush dog-remove sites/all/modules/views

Possible alias: dog-gone

Project Manifest

In order to maintain the integrity of a project and insure portability for local development and deployment, dog maintains a manifest file for the current local project. The allows us the potential to dog-rollup a project into a manifest file and then dog-rollout the same project elsewhere in a similar fashion to drush_make.

However, the dog manifest is entirely git-centric and must include the upstream and collab information. It will likely also be stored in JSON format.

In the longer-run we hope to see more convergence between drush_make, the dog manifest file and possibly the site archive format since these are all different approaches to describing a Drupal project.

Scriptability

As a tool designed to automate the low-level git workflow, dog is itself designed with scriptability in mind. Any commands which allow interaction should include a -y flag to run non-interactively, and they should all support a --backend or --json flag to do their output in script-friendly JSON.

Future Potential

We're hoping to get many Drupal projects "on the dog sled" to help "vet" these patterns and create critical mass around a set of best practices. There are also obvious implications for Drupal distributions, as well as the update manager. The sky is the limit here.

Comments

This is very close to what

mikey_p's picture

This is very close to what I've been using for some time, with one exception, 'collab' vs. 'origin'. Why need to learn another Drupalism when anyone who has been using git for some time will be very used to using origin as a push destination, etc. I'd rather not have another non-standard here when typing 'origin' is second nature, and well set into my muscle memory.

@mikey_p It is explained in

omega8cc's picture

@mikey_p

It is explained in the "Using Two Git Remotes: upstream and collab" - you will work with projects already using/having origin elsewhere (on d.o, GitHub etc) and you can't add another origin, and you can't push to the "original origin" for projects you didn't start, hence, using upstream and collab in this workflow as a new standard makes perfect sense.

I think this is a point worth debating

joshk's picture

I'm not sure mikey_p isn't onto something. I'm definitely against introducing new "drupalisms" unnecessarily, and it's true that people are used to "origin", and it still makes semantic sense in that the scope of the repository is the project itself (the site being built) and thus having the repository where custom code is developed called the "origin" seems fitting.

We're talking about projects that we are setting up ourselves, starting with Drupal and then adding our own code. The case in which there's a third source that must be called "origin" seems rare to me. Do you think this will occur often? Maybe I'm missing something...

It is not only about possible

omega8cc's picture

It is not only about possible conflicts with some existing "origin" somewhere.

Even if "collab" == "origin", there will be still "upstream" added anyway.

I found it confusing many times while working with many third party and my own stuff, especially with cloned versions of the same module, so I stopped to use "origin" at all and instead I'm using more meaningful names for my remotes, like "github", "gitorious", "drupal-sandbox" and so on. This is why I like ideas like "dog-vet" and "collab" + "upstream" because we are not machines and we like funny or at least meaningful association, since it helps us remember where are we pushing/pulling the code :)

Grace ~ http://omega8.cc

Using origin makes sense

jrsinclair's picture

Yeah, I'd add a +1 for having collab changed to 'origin'. I think this would help reduce confusion for newbie users who might be prone to manually enter something like git push without specifying repository. If the collab/origin/working repo is called 'origin' then git will push there automatically rather than throwing an error about not specifying a repo.

I know 'origin' probably isn't as semantically descriptive as 'collab', but I think the benefits of not adding drupalisms and not confusing newbies are probably worth it.

The more I think about it the more I agree

joshk's picture

Sticking with the normal behavior is the right idea, and the whole Dog automation system wouldn't really work if you didn't start out with a dog-style (furiously resists making pun) repository. In other words, taking over an existing git project which already had an "origin" would require minor surgery anyway.

Plus it's not that hard to rename remotes. Let's see if we can get Sam to weigh in.

Kind of nice to see us

sdboyer's picture

Kind of nice to see us building a (small! seriously not complaining!) bikeshed on this topic. I'll take that as an indicator of general interest in this approach on the whole, that we can focus on something like remote naming semantics.

I'm actually just working on the DogSled...err, manifesting system now, which is the central place from which a decision about the naming of the 'collab'-purposed remote would really happen. And for the most part, everyone here is right - it doesn't matter. And I don't care. It's quite easy to have name used for both 'upstream' and 'collab' be set on a per Dog-managed-project basis. And it probably is best to have it per-whole project, not per-instance - you should be using the same word to refer to the same remote as all your colleagues are.

The place where it really matters, IMO, is the docs/helptext. e.g., in the help for dog-init:

'collab' => 'The URI to use as the collab repository for this instance. If unspecified, the upstream URI is used.'

I don't know how to make that text work if we replace collab with origin, as the assumptions about origin actually put the reader at a deficit; after all, in the dog-init case, we're not cloning from collab at all. Judging from http://drupal.org/node/1122642 , it seems like that confusion could already be bad enough, without even calling it 'origin'.

On a more general note, remote naming (it being a completely local-to-your-repo thing) is one of the safest areas in Git to add drupalisms. People name remotes crazy things; at least these names are pretty descriptive. Hell, see the git.git merge logs - Junio Hamano (Git's maintainer) names all his remotes as two-letter abbreviations of the most frequent contributors.

Agreement

jrsinclair's picture

@sdboyer You're totally right about the bikeshed. Everyone seems down with the dual repo, and that's probably the most important thing to grasp about the proposed dog workflow.

In defense of the 'origin' naming proposal, I do agree with @joshk's point that using origin would make it slightly easier to shoehorn dog into existing projects. At the same time, clear documentation is going to make life easier for everyone. Whatever helps people start working smarter faster.

On the subject of helping newcomers, perhaps a diagram that makes the dual repo concept painfully clear would help newcomers. Something like the following?

Only local images are allowed.

Apologies for the dodgy clouds - the diagram was done in a hurry.

Definitely a case where a

sdboyer's picture

Definitely a case where a picture is worth a thousand words. You're entirely right that the goal is people working smarter, more standard-ly, faster - and while the diagram could maybe use a little work, it definitely captures the basic idea.

Just to clarify, I think the

mikey_p's picture

Just to clarify, I think the basic idea inside the two remote upstream and collab is very sound, and is what I do now. I suppose my overall take away point would be that making the actual branch name used for different purposes configurable via a drushrc or something like that would be an excellent idea. (this of course would also support overriding via a site/project specific drushrc as well ;)

I also want to make it clear that I'm hugely in support of this proposal, as this has the potential to remove the biggest barrier to this workflow currently, which is that it's just plain alot of work to check the status of each submodule, and handle it's upstream vs. collab when needed. Automating that step alone would make a night and day difference in enabling a proper all git workflow.

FWIW, I've noticed that the kohana folks seems to encourage a somewhat similar workflow for kohana core, and the core kohana modules. (they don't go into much detail on the collab side, but encourage your actual kohana checkout to use the submodules for each kohana core module).

I'd love to see what the implication for moving Drupal core development over to that type of approach would be, but I imagine that our dependencies are too intertwined to support that kind of workflow for core modules and subsystems.

Veery cool :-) Going to try!

wojtha's picture

Veery cool :-) Going to try!

Manifesto

jrsinclair's picture

P.S. After thinking about this, I'm very, very interested in reading this manifesto that joshk mentions. Any change of a vague guess at when we might be able to see this? Next year? Next month? Next week?

Somewhere between next week

sdboyer's picture

Somewhere between next week and next month. With any luck, next week. The two basic goals of the manifesto, as I'm writing it right now, are to a) step through the process of building a dog-managed site, and b) explore the world of possibilities that the standardized git approach opens up.

subscribing

clintthayer's picture

+1

Learning more about Git and looking for a means to manage drupal projects for the long term. If others have good white papers on this topic would love to see them. Thanks.

This is great! The Drush

Owen Barton's picture

This is great!

The Drush maintainer team has been talking about something like this for a while - we also really want to enable making this kind of sophisticated git workflow easy to use. We have actually be considering making the next version on the "pm" commands highly git-centric (with a much simplified wget function as fallback). From my point of view it looks like what is described here is exactly what we need - I think we would want to include in some higher level interfaces (project name parsing/validation, dependencies, pm-info, pm-updatecode etc), but this would get us a long way. For reference see http://drupal.org/node/908212, http://drupal.org/node/814174, http://drupal.org/node/759906, http://drupal.org/node/797190.

The only thing I am not sure I understand is the manifest file - doesn't this just replicate the contents/purpose of config and .gitmodules? The URL for the collab repository should already be sufficient to rebuild the exact tree using "git clone --recursive git://host/repo.git", so I am not sure that distributing a file to do the same thing adds much. Of course, it would be really useful and important to be able to list out a manifest to summarize "what is in the site and where does it come from/go to", since collecting that info with submodules is really not that fun to do by hand. This feels more as a human readable command output though, not something that needs to live in a file.

There's some background on

eliza411's picture

There's some background on the manifest file at http://drupal.org/node/914284. Not sure if it will help it make more sense, but you'll get an idea of who was concerned about it, and they might be able to explain more clearly.

The Drush maintainer team has

sdboyer's picture

The Drush maintainer team has been talking about something like this for a while - we also really want to enable making this kind of sophisticated git workflow easy to use. We have actually be considering making the next version on the "pm" commands highly git-centric (with a much simplified wget function as fallback). From my point of view it looks like what is described here is exactly what we need - I think we would want to include in some higher level interfaces (project name parsing/validation, dependencies, pm-info, pm-updatecode etc), but this would get us a long way.

I considered writing this as patches to drush core (albeit quietly) for a couple weeks before deciding to roll it out as a separate package. Lemme quickly be clear about what, in my mind, having it as a separate package does/does not mean:

I ultimately arrived at dog as the right approach after considering a number of different levels of drush integration and considering the implications that each level would have on ease-of-use for users, consistency of the overall experience, and the ability for the system to 'right' itself after user interference, etc..
I do not see dog as a challenge to or replacement for the existing pm functions, or even necessarily the git_drupalorg package handler. They work well for the patch-together-your-workflow case, and however cool dog might get, we shouldn't ever take away the swiss army knife. Folks can be luddites if they want :)
Also, I have no problem with eventually rolling dog directly into drush once it's more complete, if we all think that's the best way to go. I think it'd be great, actually - would certainly make provisioning with dog easier. But for now, I'd like to keep it separate until it matures, at least into alpha code.

That's where I'm at on it. Sorry to have been a bit quiet about it and just handwave about having a manifesto (which is, yes, still in progress); I kinda went dark on big, wide discussion when I realized that this needed a foundation. Now that this bit is out at least, I'd really welcome some wider discussions with drush peoples - though waiting for the manifesto could really give the fullest possible context.

The only thing I am not sure I understand is the manifest file - doesn't this just replicate the contents/purpose of config and .gitmodules? The URL for the collab repository should already be sufficient to rebuild the exact tree using "git clone --recursive git://host/repo.git", so I am not sure that distributing a file to do the same thing adds much. Of course, it would be really useful and important to be able to list out a manifest to summarize "what is in the site and where does it come from/go to", since collecting that info with submodules is really not that fun to do by hand. This feels more as a human readable command output though, not something that needs to live in a file.

Buncha reasons for the manifest. Here's a super-simple one to start: pretty much every dog repository is going to have two remotes - upstream and collab. There's no way to record that directly using submodules - it needs to be kept somewhere else.

There's also the case where you want to be able to make a pseudo-submodule that tracks a particular collab/ in a particular subrepository. You can't really take advantage of submodules very effectively at all in that case (nor will a recursive clone get you anything). That case is not on my immediate critical path with this, but it still shouldn't be impossible.

There's also the case where you might want to use a working directory for a repo that's outside of the webroot (helps address http://drupal.org/node/1119802). No way to set that up natively with submodules, but it's easy if you have a meta-system like dog retaining config setttings for core.worktree that it can roll out in every new instance.

That last case highlights where I expect a lot of additions to be useful for the manifest file - custom repo config (as in .git/config) that is impossible to transmit using any built-in git functionality. And not even just config - this system will seriously leave GUI users behind unless we can roll at least some of the functionality into git hooks. And ensuring that the right sets of git hooks are attached to a repo can't be done without local action - managed by something like dog, reading from its manifest.

So while I most definitely agree that part of dog's responsibility is to automate complex-multi-step processes into unified concepts with single commands (because, let's be honest - submodules are really frustrating and annoying), it'll definitely need its own manifest data to do it properly.

I do not see dog as a

msonnabaum's picture

I do not see dog as a challenge to or replacement for the existing pm functions, or even necessarily the git_drupalorg package handler. They work well for the patch-together-your-workflow case, and however cool dog might get, we shouldn't ever take away the swiss army knife.

I don't see it as an either/or situation. If you need to circumvent pm to implement dog, I see that as a problem. I'd rather use this as an opportunity to fix pm, or detach the VCS stuff from pm like we've discussed before.

Regardless of how widely accepted dog becomes, there will be other approaches, so drush should enable projects like this to be very lightweight additions to the existing apis.

I ultimately arrived at dog

Owen Barton's picture

I ultimately arrived at dog as the right approach after considering a number of different levels of drush integration and considering the implications that each level would have on ease-of-use for users, consistency of the overall experience, and the ability for the system to 'right' itself after user interference, etc..
I do not see dog as a challenge to or replacement for the existing pm functions, or even necessarily the git_drupalorg package handler. They work well for the patch-together-your-workflow case, and however cool dog might get, we shouldn't ever take away the swiss army knife. Folks can be luddites if they want :)
Also, I have no problem with eventually rolling dog directly into drush once it's more complete, if we all think that's the best way to go. I think it'd be great, actually - would certainly make provisioning with dog easier. But for now, I'd like to keep it separate until it matures, at least into alpha code.

To be clear this was me saying "yay - you just did exactly what we were thinking", not suggesting that this should have gone into Drush core, or even needed a bunch of prior hand waving/discussion. As I suggested, the current PM commands split the package handler and version control functions into separate plugins/engines (which made a lot of sense in the CVS/svn world) and so currently is architecturally not able to do the kind of things dog does - trying to wedge a dog in here would have resulted in a dead dog! We do want to fix PM to allow this, of course - but we need to get a better handle what the next-gen API should look like.

I think it totally makes sense to have this as it's own codebase, especially at this stage. Once it is a bit more mature we can more clearly identify the interfaces (as in API, not UI) that PM would need (add project, upgrade project etc), and get dog, wget and git_drupalorg to all speak the same language. This will mean that normal pm commands could work in the same way from a UI point of view on both dog and non-dog sites. I don't think we are at a point where it makes sense to attempt this, although it might be interesting to look at the existing package handler (e.g. wget) and version control (e.g. svn) interfaces, and seeing if/how they map onto dog commands/functions.

So while I most definitely agree that part of dog's responsibility is to automate complex-multi-step processes into unified concepts with single commands (because, let's be honest - submodules are really frustrating and annoying), it'll definitely need its own manifest data to do it properly.

Thanks for your answer here - this makes a lot of sense.

I guess my only question is the usability implications of users bypassing dog and doing things with git directly - obviously some git operations (commit, push...) should be fine and used frequently, but then other operations could get the git metadata out of sync with the dog metadata, and/or break dog functionality. How do users know what is safe/unsafe? Of course you could break dog functionality even without a manifest, but it seems like having the possibility for an out of sync manifest may make this more fragile. I am not to what extent this is really a problem, or what the solution is though. Perhaps some dog managed git config (auto-alias commands to include warnings if you try something risky, or use hooks to do the same?) could help prevent this?

Incidentally, have you seen http://drupal.org/project/githook?

@msonnabaum & @Owen Barton

sdboyer's picture

@msonnabaum & @Owen Barton re: eventual integration & dog/pm - awesome. It sounds like we're all quite on the same page, so I look forward to trying and figuring out lots of shit as dog evolves, all the while being conscious of the lessons & ideas we learn wrt the existing pm system.

I guess my only question is the usability implications of users bypassing dog and doing things with git directly - obviously some git operations (commit, push...) should be fine and used frequently, but then other operations could get the git metadata out of sync with the dog metadata, and/or break dog functionality. How do users know what is safe/unsafe? Of course you could break dog functionality even without a manifest, but it seems like having the possibility for an out of sync manifest may make this more fragile. I am not to what extent this is really a problem, or what the solution is though. Perhaps some dog managed git config (auto-alias commands to include warnings if you try something risky, or use hooks to do the same?) could help prevent this?

Incidentally, have you seen http://drupal.org/project/githook?

I think somebody may have pointed me at githook before, not quite sure. What I do know is that using hooks for the sort of validation/reinforcement of git actions done in githook is crucial to this very valid and important question - how do we keep dog and humans playing nice together? My thinking has two parts:

  • dog-vet is our line of defense. It needs to be able to do really thorough vetting of an instance, identify inconsistencies and suggest solutions, or even automagically fix problems where possible.
  • And hooks. Good hooks are what will make it OK when folks make mistakes, directly use git commands that we provide (necessary) wrappers for. They're also the only thing I can think of that will make this system passable for GUI users.

IMO, dog's acceptance is basically going to turn on how well we handle this problem. We want all these excellent goodies in the background, but if it all ends up meaning that using dog is as or more onerous than using git directly, then dog will be a failure. And rightly so.

This is great, thanks to

ChrisBryant's picture

This is great, thanks to everyone who's been working on this. We've been evaluating and planning for a similar workflow. It will be great to have a standard around this which will make collaborating even easier and this approach makes a lot of sense. Snoop Dogg would be proud!

This just crossed my mind,

mikey_p's picture

This just crossed my mind, but I imagine would be a concern for quite a few folks: The number of hosted git repos they may need. Supposing they do some patching of contrib modules and make those as public repos on github, that wouldn't count against any quota there and could encourage collaboration*. But if a project needs custom modules, than that could be at least 1 additional collab repo in addition to the the collab repo for core. This could add up fast depending on how your hosting/github/unfuddle/whatever bills you. This could be quite a big downside.

Possible solution: Would it be okay to keep things that don't muck with core, but don't have an upstream repo (i.e. custom modules and themes) directly in the core repo? I've been using this now, and it doesn't seem to interfere with my ability to merge with upstream repo of drupal core from git.drupal.org.

  • That's another scenario I hadn't considered, say I take module Foo and apply 3 or 4 patches to it, and submit those on d.o, and push the patched version to my collab repo, is there a way to say reuse that collab repo, or use that collab repo as a starting point for other Drupal projects I work on using dog?

Yeah, we talked about this a

sdboyer's picture

Yeah, we talked about this a fair bit last week. There's no problem with doing that at all. It means you maybe get a slightly mucky history in your core repo, but oh well, we're all used to that. Best part is, if later on you decide to take that custom module and make it its own real repo that you contribute back, you just run a quick git filter-branch and separate out the commits on the subtree for the theme/module and make it its own separate repo. Ipso facto magico, push it up to d.o.

Also, if repo proliferation is an issue, you can also just use a d.o sandbox for that code (if the code is appropriate to put in public).

Deployability

bjaspan's picture

I confess to being a git neophyte but certainly plan to support it in Acquia Cloud soon. My primary input into the discussion, therefore, is to make sure that that whatever scheme we come up with is easily, automatically, and reliably deployable. This is critically important for all hosting environments like Dev Cloud and Pantheon that want to automate the deployment process.

In my mind, deployability requires:

  1. Accessible code

Somewhere, accessible to the hosting environment, there exists a single repo that contains 100% of the code that makes up the site's docroot as it should be deployed at any given moment. The repo can can one or many branches, and the owner can do whatever it wants with it, but when the hosting customer tells the environment, "please deploy the HEAD of branch ABC, or symbolic tag XYZ", the hosting environment needs to know exactly where to find all of the code. If that requires contacting multiple remote repos to get submodules or whatever, that is likely to cause an unreliable deployment process because at any given moment some of those repos might be unreachable.

Also, if the hosting environment needs to know anything about the methodology the repo owner is using in order to find the code (e.g. "this is a DOG repo"), that is likely to lead to complexity and unreliability, because there will always be edge cases, people who need to do things slightly differently, whatever. So saying, "there are remote repos you need to contact, but you can find them in the file foobar/.manifest", while seemingly implementable, is going to be painful long term.

To understand the need for this, consider a cloud hosting environment. It might want to build new web nodes on the fly as load increases, so it needs to know where to get the code without any real-time intervention from the site owner. Or in a single-server setup, the server instance and disk that the site is currently running on might suddenly fail, so the environment needs to rebuild that server as fast as possible.

Those last examples imply another desirable property, if not a requirement. It should be possible, and in fact the most common setup, to have the primary repo for a site hosted at the hosting environment. When a new server needs to be launched, it is much more likely that the hosting environment can reach a repo on its own servers than that it can reach github or whatever, especially when the site is hosted in Singapore or Australia or on the moon.

  1. Commit notification

DC (and I think Pantheon) are set up to automatically deploy any new commits to the currently deployed branch. This means we need to know when a new commit arrives. If the repo is hosted locally, that's easy. If it is remote, either that repo provider needs to provide a remote callback for new commits (e.g. a URL they invoke), or else we'll have to poll it, which won't win any friends. I guess this is not really relevant to the repo methodology being discussed here, but I just thought I'd mention it. :-)

  1. Writeability

Both Dev Cloud's and Pantheon's UIs allow the customer to perform various repo write operations (commits, branches, tags, etc.) from the UI. This requires us to have write capability to the repo. Currently, we both operate with a local repo that we control, so write access is easy to arrange. For a remote repo at github or wherever, presumably the user could give us credentials to that repo for us to use. This isn't that complicated. However, it does point out one potential wrinkle. It occurs to me that one response to the need to have a single repo containing all the code will be, "well, the site developers can just push all the changes they want to the hosting environment's repo." In that case, though, if the hosting environment is deploying from a local repo but the primary repo lives elsewhere, then when the user performs an action from the UI (e.g. "create a new tag to deploy right now"), that tag will be created in the local repo---or else the environment will need to create in the remote repo, then pull the changes, and then deploy them, and that might violate the expectations of a dev team that thought, "hey, we're pushing all our changes to the hosting env, so what are you doing writing to our primary repo?"

Sam: I'm looking forward to discussing all of this at our call next week. :-)

Sam: I'm looking forward to

sdboyer's picture

Sam: I'm looking forward to discussing all of this at our call next week. :-)

Me too :) But in advance of that, let me clarify one thing, re: this point -

Those last examples imply another desirable property, if not a requirement. It should be possible, and in fact the most common setup, to have the primary repo for a site hosted at the hosting environment. When a new server needs to be launched, it is much more likely that the hosting environment can reach a repo on its own servers than that it can reach github or whatever, especially when the site is hosted in Singapore or Australia or on the moon.

I think there's a key misconception here that threads throughout, so let me at least clear that one up. While 'upstream' repos are at some canonical location (e.g., drupal.org), 'collab' repos can be located anywhere. And 'collab' is the only repo from which a new site instance is ever rolled out. So a hosting provider invested in dog can (and should, to smooth the process) provide collab repos for everything - so all the repos are indeed locally available within the hosting provider's network. Which also means writeability. And notification hooks that are readily available. My thought all along has been that hosting providers like Dev Cloud would do all repo hosting locally, and maybe possibly in the future allow hosting from an external provider.

One other point to address, though:

Also, if the hosting environment needs to know anything about the methodology the repo owner is using in order to find the code (e.g. "this is a DOG repo"), that is likely to lead to complexity and unreliability, because there will always be edge cases, people who need to do things slightly differently, whatever. So saying, "there are remote repos you need to contact, but you can find them in the file foobar/.manifest", while seemingly implementable, is going to be painful long term.

The goal of dog is to create a portable Drupal package that is both versioned and deployable, both human & machine read/writable, and easily encapsulated within wrapping deployment workflows, config management, and/or build systems. In the current implementation/spec, it carries a pretty minimal amount of information required to do the internal assembly - and most of that is just git settings to pass around. To that end, I should note that building workflow-specific applications on top of Git is not something we should be circumspect about - because the Git developers aren't. Read the git dev list for a bit - you'll see that the express intention is for people to build systems around Git.

That said, I entirely agree that one of the primary requirements for dog is that it is internally robust - that it is nigh-impossible to get projects managed by it into unrecoverable, or even unpredictable, states.

love the idea of In the

mike stewart's picture

love the idea of

In the longer-run we hope to see more convergence between drush_make, the dog manifest file and possibly the site archive format since these are all different approaches to describing a Drupal project.

--
mike stewart { twitter: @MediaDoneRight | IRC nick: mike stewart }

Site Builder Guide

logickal's picture

Along these lines, I've posted a first draft of a rewrite of the Site Builder Guide back at http://drupal.org/node/803746.

This thread has been very helpful in a number of ways, so I'd love for you all to take a look and let me know your feedback - file issues against it if you feel so inclined.

I should preface it by reiterating something that I tried to convey in the introduction - it is meant to be am entry-level guide to using Git in this manner, without any additional tools such as Drush or Dog. I definitely think that there will be additional pages outlining those workflows as well, but I felt that the best start was to approach it from a lower level to give people a direction to follow and build upon with the additional tools.

Let me know your thoughts!

Good work

alanburke's picture

As the person whose work you wiped away - I couldn't be more delighted!
I really like how the instructions are independent of DOG or Drush.
We can add on other pages detailing where DOG and Drush can help automate certain tasks.
Some good diagrams are in order, that could help clarify that there are multiple 'read-only' external repos - for core, each of the individual contrib modules and themes etc, but only one 'writeable' repo is really needed [unless you patch/hack a module].

Thanks!

logickal's picture

First of all - let me just say I breath a sigh of relief to read this. :) I felt a little bit of angst at totally blowing away your original documentation, but in the end (and with a little bit of convincing) I decided it was better to ask for some forgiveness rather than permission.

Yes, I made a conscious effort to make it a workflow that could be implemented by someone with only Drupal and Git at their disposal. Even after we build documentation around using the additional tooling, I think it's important that there's a resource that explains the process from a lower level. I fully expect that this first-ish draft will change as we get more best practices defined based around what Drush and Dog do.

Also, good thought on diagrams - I agree 100%. One of my partners is going to be presenting at the local Drupalcamp and he's creating diagrams for the presentation that could very well make it into this document once they're done.

Example of sexy git diagrams

Liked how this was written;

Antoine Lafontaine's picture

Liked how this was written; great documentation!

Saw a few things I stumbled on while learning git fairly well explained.

Would really like to see something about subtree instead of submodules... I remember that there was a nice thing written about those in the git pro book just after the chapter on sub-modules (and how not to use them).

Now that you’ve seen the difficulties of the submodule system, let’s look at an alternate way to solve the same problem.

http://progit.org/book/ch6-7.html

Need to read more of the thread to see if this is discussed down there, but I'd like to see if some though have been put into using subtree merges instead of submodules in this proposed workflow.

Switching Cores

alanburke's picture

One workflow situation has occurred to me.
How easy would it be to switch cores in this git scenario.
For example, we start a project with a clone of Drupal core from Drupal.org,
add modules, themes etc from individual drupal.org repos.

Time passes, the site gets popular, we need to move to Pressflow Drupal.
How easy is to do that, and maintain full history etc?

Would Pressflow have to start a new Repo that forks Drupal.org drupal core,
and once that was in place, we could switch upstream to Pressflow easily.

Or could we use the Pressflow git mirror as it currently stands?

As long as that alternative

sdboyer's picture

As long as that alternative core is a good steward and is itself based on a clone of the original Drupal core repository, then it's not too tough. It'd be more or less the same as a standard update from upstream, actually - except that you change the git URI for your upstream source first.

All variants should start w/canonical drupal core

joshk's picture

Would Pressflow have to start a new Repo that forks Drupal.org drupal core, and once that was in place, we could switch upstream to Pressflow easily.

This is correct. Ideally any an "alternative core" like Pressflow would start with the canonical Drupal upstream. That way you will be able to merge them in cleanly. I know this is the plan for Pressflow 7. It is also a fine plan for people who want to make very involved installation profiles or "drupal plus" applications.

In the realm of Drupal 6, you can generally do this with a "rebase". Git it smart enough to realize when files are the same, but it's a lengthy process and doesn't give you quite as nice a version history.

I'm new to git, but

jschumann's picture

I'm new to git, but understand the project layout you're talking about. I like the idea of letting dog handle the messy stuff for me, but I need to start working today, and dog isn't ready yet (or so I understand).

Can you publish the git commands you're using to structure a project using dog, so that I can create the layout now, manually, and let dog take over when it's ready?

Very valuable question,

sdboyer's picture

Very valuable question, should have laid this one out. Unfortunately, it'll probably be a bit tough for newbies to slog through - as you noted, what's under the hood here is messy. But if you follow these basics:

  • Use dog-init to set up the repository, as that's pretty much ready.
  • Use git-submodule to clone every upstream module/theme that you get from drupal.org.
  • For any custom module/theme, work right there in the main repository, don't bother with subrepositories.
  • Follow a branching strategy that's at least similar to git-flow.

What you create should be basically compatible with dog, once it reaches maturity.

Thanks a lot, definitely want

donquixote's picture

Thanks a lot, definitely want to try this.

Just one question.
Separate repositories for processed releases.

What about the stuff added by the packaging script - UPDATE.txt, expanded $Id$ and info file?
If we pull from git, we don't get this stuff, do we?
This is especially relevant for existing projects, that already do contain the added stuff. Switching to the unprocessed git version of modules and core will make a huge diff with a lot of pointless noise.

Switching costs

joshk's picture

Switching costs moving from an existing project to this style are unavoidable. The diff really shouldn't be that bad as a one-time thing. Adding the upstream via git surgery will be harder, and I don't imagine that would be an "automatic" feature for quite some time. The near-term answer is going to be to start new projects with dog, or be prepared to spend some time on the transition.

Having the processed core and

donquixote's picture

Having the processed core and contrib as "upstream", instead of the unprocessed one, would make the gap to a traditional work flow smaller, which would be a good thing. Or do you think the expanded $Id$ would have any unpleasant effects? Such as, more noise in changesets.

The expanded $Id$ is a big

sdboyer's picture

The expanded $Id$ is a big PITA. That's why we removed it as part of the migration. Look through all the new Git repositories, you'll see they've been removed. And look in any tarballs generated from releases made after the migration was completed - you'll see there are no expanded tags in there, either.

Ah, I actually misunderstood

sdboyer's picture

Ah, I actually misunderstood this a bit at first because I didn't read the link. So:

  • As I said, $Id$ tags should be gone. They no longer have any place in Git repositories...though I'm a bit ashamed to admit that, after spending an inordinately long time getting all the logic in place to make that happen, it turns out there's the gitattribute ident that would allow us to continue simulating the feature. Oh well. Anyway, point is that the tags should pretty much be gone from repositories.
  • LICENSE.txt won't make it in to repositories anytime soon. It's been suggested to me by legal-y people that this is a potential issue, though that conversation's gone dead in the water, and I'm operating under the assumption that it's a small enough problem that we can deal with it.
  • The information in the .info file is for the update manager. That's rendered totally unnecessary by git_deploy.

Updating existing projects to use dog is not something I'm looking to deal with right away. It's important, obviously, but the basic functionality needs to be working for new sites before we can think about scooting existing ones into this format.

And to be clear, it's not just "especially" relevant for existing projects. It's ONLY relevant for existing projects. If you're using dog, you should never ever ever EVER have a single tarball from d.o in your system. All git repos, all the time, period. Hybridizing makes things unnecessarily complicated.

never ever ever EVER have a

donquixote's picture

never ever ever EVER have a single tarball

The idea was, if the processed stuff (that is, LICENSE.txt included) was provided as a (read-only) repo, then we could start from there.
For the git workflow it would be nice to have a LICENSE.txt added, but we don't need expanded $Id$ (good to know it's gone), and we can discuss if we want the stuff in *.info or rather not.

I personally like the *.info stuff, because it's an easy way to know the version of a module. And probably the "available updates check" also uses this information. Yes we don't drush up anymore, but we might still want to have the warning messages about available security updates.

So, I think it is reasonable to ask in that linked issue, if d.o. could provide a repo with the processed module releases. And once we have that, I imagine we all want to switch to that one, if only for the LICENSE.txt.

We could even think about an intermediate repo that only has the LICENSE.txt, but nothing else.
And if d.o. does not want, someone could set up the same thing on github or somewhere else..

Individual projects can add a

sdboyer's picture

Individual projects can add a LICENSE.txt. Truth is that core, at the very least, should probably put one in. That's how it gets in the git repo.

So when you say ~"a read-only repo of the processed stuff," there are a few possibilities to what that could mean:

  • A fresh new git repo containing a generated tarball that's been checked in, with one commit per tarball.
  • A repo with the additional files/changes made in a new commit on the tip of every branch.

The first proposal has already been out there for a long time: http://drupal.org/node/806484 . I don't like release repositories, because IMO they solve a problem that doesn't really exist - and in the process obliterate everything useful about git history. If you want to use tarballs, then USE TARBALLS. Don't just wrap their data in a git repository because "hey, we use Git now!" If you want to do that locally, fine - but I don't see a reason to invest infra time and resources in doing it. Beyond that, I see it as an inferior method for sitebuilding, so I'd actually rather we not support it at all, as that'll give it the impression that it's a good idea.

The second method is simply not feasible, period. We'd have to have background workers do nothing but continually rebase a commit on top of tens of thousands of repositories - ALL of which are copies of the real repos, and need their own repo location strategy, management when things go wrong, etc. And all of that so that people have a repo they can clone which does an upstream rebase on every single push. So every single merge from upstream will be painful and nasty.

As for the information in the .info file and managing upstream updates (security or otherwise), I'll say it again: git_deploy takes care of that.

The real question is - what problem are you trying to solve with this?

And..."I like doing it this way" isn't a reason. Dog is about codifying some best practices into real, assumable rules - not about accommodating every possible way to put together a Drupal site. That's what we've already had for ten years.

The second method is simply

donquixote's picture

The second method is simply not feasible, period.

All of this can be automated.
Maybe it will be resource-expensive - in this case we should probably discard the idea. But maybe it is not.
The release repo would have the original repo as a remote, and it would have its own branches for all published releases.

For every new release, it would check out that version from the original repo, add the LICENSE.txt and *.info stuff, and commit the result into the release branch. That's the minimal thing, which does not require any merge or rebase or whatever.

The benefit is small, but so is the cost - or if it is not, we just say goodbye to this idea.

If we want to be a bit smarter, then we need to somehow make both origin-1.1 and release-1.0 parents of release-1.1. Not sure how exactly we would do that, probably involves merge and/or rebase. But still, it would be automatic.
And, in case of merge conflicts, we can always take the version from origin, then add the usual stuff (LICENSE + info), and declare this to be our merge result.

I need to read a bit more about git, but from what I know so far, this should work.
Expensive or not, only a test can tell.

This discussion is now

sdboyer's picture

This discussion is now officially completely off topic for dog. Let's please move it to the issue you opened.

git_deploy is a real dog

pillarsdotnet's picture

As for the information in the .info file and managing upstream updates (security or otherwise), I'll say it again: git_deploy takes care of that.

So dog will depend on a working version of git_deploy, whose project description says, in total:

Placeholder for an analogue to the CVS deploy module. This needs to become a real, finished and tested module for phase 2 of the migration to be finished.

The real question is - what problem are you trying to solve with this?

See #1013302: Versioned dependencies fail with dev versions and git clones.

Bravo!!

kbell's picture

+1 (subscribing), and:
after learning to use Git and drush over the last year, I have been longing for (and making my own private - and rudimentary by comparison to this discussion - forays into) developing a process similar to the one(s) described above. I just want to thank all of you for bringing this discussion out in the open and for the work you're doing. The biggest "hole" in working with Drupal for me has always been about process, and I'm frankly happy to be alive to see this all happen - such exciting work!

Equally exciting for me is seeing the birth of Pantheon (for the same reasons stated above), and when these processes mature - Pantheon + DOG - life as a Drupalist(a) will be sweet indeed.

Thank you all very VERY much!

--Kelly Bell
Gotham City Drupal
twitter: @kbell | @gothamdrupal
http://drupal.org/user/293443

DOG or Aegir+Drush Make?

Korchkidu's picture

Hello,

I can't seem to understand why we should use your workflow instead of the one explained here:
http://greenbeedigital.com.au/content/drupal-deployments-workflows-versi...

This is a real question and not a remark. I am really trying to make a comparison between both workflows. So, what are the pros and cons from your experience?

Best regards.

Not either/or

joshk's picture

DOG is similar, but somewhat more ambitious than Drush Make. You should review what we're talking about in terms of using git submodules.

The Aegir+Drush Make creates "one big repo" for each platform. That's fine, but you're stuck doing all upgrades by hand. A big part of the value of the DOG process is having the ability to pull updates/upgrades to Core, Contrib and your own Custom code from their actual sources. This lets you really leverage git much more.

Whether or not you then use Aegir to deploy is a totally other question.

I fail to see how building

langworthy's picture

I fail to see how building drupal sites as a profile+make file creates "one big repo". I've found them to be incredibly slim since all you track in version control is the .profile file, the .make file, custom modules and themes, and usually a shell script to easily rebuild the codebase using drush make.

Granted upgrading core/contrib is easier using DOG but not by a wide stretch. With drush make you just type in a new version number to use.

"One big repo" is in contrast

sdboyer's picture

"One big repo" is in contrast to a cluster of repositories, which is the approach dog takes.

Granted upgrading core/contrib is easier using DOG but not by a wide stretch. With drush make you just type in a new version number to use.

I disagree. With drush make you just type in a new version to use, IF:

1) You have no patches to that module that need to be applied.
2) If you do have patches to be applied, the patches still apply cleanly after the upstream changes.

If you've been a good steward and gotten your patch contributed back upstream, you get penalized - the patch will fail, and you have extra work to do to clean it up. Not a lot of work - but enough that it makes it not a safe task for machines to run.

And that's a lot of the point here - some of the differences between drush make and dog certainly are minor (I'll reiterate that the original plan WAS to just extend drush make for this). But they're enough to make the difference between a system that's machine read/writable throughout the entire lifecycle of the project, vs. a system that can only automate certain initial/setup-type tasks.

We have skipped over the

sdboyer's picture

We have skipped over the "why" a bit in this post, and I'm probably going to continue to skip the more in-depth discussion here as I'm working on explaining that elsewhere.

Summary version, though: drush make was the best we could do in the days where CVS was the upstream. And if you're unwilling to deeply embrace Git, it's still pretty much the best out there. But now that we've migrated to Git, there's a whole new world of possibilities. Problem is that while I know there's amazing, robust stuff you can do with Git+Drupal, it takes a fair amount (and in some cases, a very high amount) of knowledge to unlock that. Dog is really an attempt to bring that powerful workflow potential to bear in a way that anyone can use it.

Some bulleted benefits I see of dog vs. drush_make, in particular v. the link you shared:

  • Because dog can rely directly on Git instead of custom versioning & heuristics, it is FAR better at managing upstream updates than drush_make could ever be
  • Communicating changes back to upstream is also much easier
  • The structured nature of dog adds long-term stability & maintainability to any site
  • All your code, and potentially even some of your attached data, is managed by one unified system. No patchfiles scattered all over the place
  • Probably most important, drush_make is fundamentally done after initially setting up an instance (there's drush_make generate, but it's error-prone). Dog is a tool that you use for initial setup and throughout the development process - in many cases, it replaces direct git commands. In other words, it's not *part* of a workflow, it's a *whole* workflow.

so, that's off the top of my head :)

What I really like about

langworthy's picture

What I really like about drush make is that you can look in one place (the .make file) to see a full definition of a sites structure. Versions of all contrib, patches (there aren't any patchfiles scattered around, they are all defined in a single place) libraries etc. I worry that with a system like DOG I would lose that.

That's what dog-vet - which

sdboyer's picture

That's what dog-vet - which reports the overall status of a given instance - is for. Which has the advantage of being able to differentiate between what should be there (based on manifests) vs. what is there.

* Because dog can rely

Korchkidu's picture
* Because dog can rely directly on Git instead of custom versioning & heuristics, it is FAR better at managing upstream updates than drush_make could ever be

Any example? I don't see what you mean by "custom versioning & heuristics"

* Communicating changes back to upstream is also much easier

Same here. What do you mean? Are you talking about patching drupal core?

* The structured nature of dog adds long-term stability & maintainability to any site

Is it a fact? I mean, how could we ever know?

* All your code, and potentially even some of your attached data, is managed by one unified system. No patchfiles scattered all over the place

I thought Drush Make was able to get the patches by itself?

* Probably most important, drush_make is fundamentally done after initially setting up an instance (there's drush_make generate, but it's error-prone). Dog is a tool that you use for initial setup and throughout the development process - in many cases, it replaces direct git commands. In other words, it's not *part* of a workflow, it's a *whole* workflow.

This is indeed one important point. You could use DOG since the very beginning of a project. With Drush Make, you need to wait until the end and then, it is less interesting.

Thanks for your answers
Best regards.

What do you mean? Are you

sdboyer's picture

What do you mean? Are you talking about patching drupal core?

Core yes, but much more likely, downloaded contrib.

Is it a fact? I mean, how could we ever know?

Not a quantifiable one. But consider the difference between inheriting a site built using whatever random methodology the devs decided to use, vs. inheriting a site built with a structured methodology like dog. Or even just coming back to maintain a site you built a year ago - do you really remember all the tweaky little things you did? Dog would.

I thought Drush Make was able to get the patches by itself?

It is. Question is, where do you put those patches in the first place for drush make to grab? It's out-of-band data that you have to come up with a storage strategy for yourself.

Like I said, there's a bigger article coming on this - I'm sorry to dangle that, but writing these things up takes me a lot of time, and I want to do it properly at the right time, once the picture has cohered a little more.

"Like I said, there's a

Korchkidu's picture

"Like I said, there's a bigger article coming on this - I'm sorry to dangle that, but writing these things up takes me a lot of time, and I want to do it properly at the right time, once the picture has cohered a little more."
Please, buy a good coffee machine, and release it asap...;) If you can put some Aegir integration in there too, it would be great;)

Anyway, thanks for your answers, I believe I am starting to understand what you are trying to achieve.

Best regards.

I get the feeling I'm using

langworthy's picture

I get the feeling I'm using drush_make differently than others. I use it from the very beginning of a project by forking Build Kit and then building up the .profile a little bit and the .make file a lot as I add to the project. Why would you only begin to use drush_make at the end of a build?

From a quick glance at Build

sdboyer's picture

From a quick glance at Build Kit, it looks like it's achieving a similar thing that dog does. My guess would be the difference is that dog is structurally built around the idea of an arbitrary large number of "Build Kit"-like baselines, which can act as anything from full-blown install profiles to just quick helpers. Dog's role, then, is quickly and easily manipulating such kits.

Same principle, though. The fact that Build Kit requires git is what makes the big difference between it and plain drush_make's capabilities.

And actually, I realized I

sdboyer's picture

And actually, I realized I didn't comment on the Aegir part - really, Aegir has minimal feature overlap with dog. Dog could (and we hope, will) be paired with Aegir and act as the engine for managing sites in the same way drush_make does. drush_make is a less end-to-end capable engine. Or will be, rather.

Also, just to be clear again, I'm not slagging drush_make here. It's an excellent tool - but it allows any sort of sources & doesn't care about the underlying vcs your site is stored in. Allowing that flexibility means it can never tackle cohesive, complete workflows in the way Dog can. That from a guy who's worked on the cross-vcs platform that runs d.o's git infra :)

As others explained above

omega8cc's picture

As others explained above already, DOG could work as a perfect tool to manage your code, while Aegir and Drush Make are designed to work on different level, since they help to manage your sites and their environment (the platform) but they don't help in managing the code at the low level at all.

It helps to think that in Aegir context the app is not a site, it is an install profile with corresponding makefile, used by Drush Make to create an environment (the platform) where the app lives, and the site is just a deployed and managed (with the help of Aegir) instance of the app.

But that instance (the site) and its environment (the platform in Aegir) still has the code you may want to maintain both on the platform level and on the site level (for site specific stuff), and you don't want to maintain separate platforms per (every) site, plus, Aegir will not help at all with the code in the site specific space (it just moves it between platforms as-is w/o any comparison checks etc), so you still need something to manage/track the changes at the code level, and this is where the DOG can be a perfect match for Aegir and Drush Make, not a replacement for anything, imo.

But that instance (the site)

sdboyer's picture

But that instance (the site) and its environment (the platform in Aegir) still has the code you may want to maintain both on the platform level and on the site level (for site specific stuff), and you don't want to maintain separate platforms per (every) site, plus, Aegir will not help at all with the code in the site specific space (it just moves it between platforms as-is w/o any comparison checks etc), so you still need something to manage/track the changes at the code level,

Yeah, absolutely agree. Describes the separation of responsibilities quite well.

and this is where the DOG can be a perfect match for Aegir and Drush Make, not a replacement for anything, imo.

I'm not quite in harmony on this point, though, at least not within our initial plans. Because...

It helps to think that in Aegir context the app is not a site, it is an install profile with corresponding makefile, used by Drush Make to create an environment (the platform) where the app lives, and the site is just a deployed and managed (with the help of Aegir) instance of the app.

The initial plans are not really targeted towards being able to make a dog instance out of just anything. They need to be crafted in a specific way. Which means dog will need to do the initial setup as well as ongoing code management - it won't be able to deal with something drush_make has built.

Of course, that's a problem I'd love to see solved, and frankly it shouldn't be that hard to solve it. There are two basic approaches, both of which are worth pursuing - teaching dog to read drush makefiles, or teaching dog to convert an existing site tree into something dog-compatible. But from the perspective of completing dog's basic featureset, that's out of scope: the bulk of what makes dog dog are all the things it can do in a working setup. Until those things get built, dog is vaporware, so there's no point in investing effort to allow multiple routes to creating a working setup.

git-subtree

gagarine's picture

Submodule are great but definitely a complicated concept, hard to follow and deploy. You fracture the project in pieces but when you want to move all those pieces on your dev/staging/prod server or to a co-worker it become quickly a mess (or perhaps I miss something). With one repositories you can just do a "git push test-server" and use some magic git hook to checkout the code where you want.

So for Drupal project I'm looking more in https://github.com/apenwarr/git-subtree approach than having 50 submodules.

You fracture the project in

sdboyer's picture

You fracture the project in pieces but when you want to move all those pieces on your dev/staging/prod server or to a co-worker it become quickly a mess (or perhaps I miss something).

Yep, you've missed the point of dog - take that fractured repository and weave it back together, automatically and transparently.

The histories created by the subtree approach are a mess, unfortunately.

I see the dog-rollup and

gagarine's picture

I see the dog-rollup and dog-rollout. But how do you push regularly on a staging server? If you push to a remote (bare repositories) and checkout a branch outside the repository can you still build submodule?

I add my servers as remote and when I want to update the code I just push to those remote (like http://toroid.org/ams/git-website-howto)? So the version on test server is almost always the state of code without the need to really deploy anything.

Dog's goal is not to be a

sdboyer's picture

Dog's goal is not to be a transport mechanism between multiple instances. It's based around a hub-and-spokes model, with the central collab repositories at the center, and every functioning instance (be it for dev, staging, prod, whatever) acting as other independent spokes. So you don't perform a deployment from your local dev instance - at least, not with dog's built-in command set. That's a separate layer of responsibility - one that dog is very much interested in working with and providing useful data to, but not directly responsible for. Maybe you wire up your post-receive hooks to trigger a deployment on a particular server whenever a push comes in; maybe you trigger a build system like phing or ant; maybe you run it all via Jenkins. The idea behind dog is to expose information that makes it trivially easy for you to construct your own deployment events like that, but not to be directly responsible for it. Like I said, its concern is with the communication between hub and an individual spoke, not spoke-to-spoke. There are simple enough tricks that you can employ to make that happen, but it's not our core focus. At least not for dog 1.0.

If you really wanted to set up a test server that you directly pushed to per your example link, it could be done - but it would require a post-receive hook on that repository that triggered up some dog logic to ensure everything gets put in the right place. Can't do it (reliably) just with core.worktree or GIT_WORK_TREE, unfortunately.

Thanks for this interesting

gagarine's picture

Thanks for this interesting and complete answer. I'm impatient to try it out on a real projet but I will wait a 1.0 and adapte first my deployment process.

noob question

stormwatch's picture

So, should I use --upstream=7.x or leave the default. Would there be any difference?

never mind

stormwatch's picture

I see the default is the latest stable branch. Stupid question anyway...

aegir

stormwatch's picture

Is it possible to use dog-init --upstream= with aegir repository? If so, which one? provision? hostmaster? ??

Got the reply from sdboyer at

stormwatch's picture

Got the reply from sdboyer at #drush
eventually the plan is to make it work so that someone else can prep an upstream dog repo for you and then you use dog-init to grab it down, but it's gonna be a little while before we get all that in order

Nice approach … but

Pisco's picture

The approach dog takes, is very similar to what we do at my company, but it lacks one important, at least for us, feature which I'll talk about in a second.
Our goal is, to stay as close to the upstream Drupal as possible, but when we write patches we need them as soon as possible in our testing/staging/production environments. We therefore have our own Drupal repo, the one dog likes to call contrib. (On side note: I think it's a terrible idea to call that remote contrib, Git best practice is to have an origin and an upstream remote!). Those patches a very general and not at all project specific, this means that when we build a new Drupal-project we like to reuse that same Drupal-repo for new projects and share it across all our projects.

The solution we came up we like to call deployment-platforms and it's basically a very small Git repo with a lot of submodules. It looks like this:

.git/
.gitmodules/
drupal/      <-- a Git submodule
libraries/   <-- contains various git submodules
modules/     <-- contains various git submodules
themes/      <-- contains various git submodules

For this to work we had to convert:

sites/all/modules/
sites/all/libraries/
sites/all/themes/

to relative symbolic links:

sites/all/libraries@ -> ../../../libraries
sites/all/modules@ -> ../../../modules
sites/all/themes@ -> ../../../themes

This allows us to pull in our very own shared Drupal repo at very specific versions for each project/deployment-platform. We can use the deployment-platforms to install code on test/staging/production environments using Git:

git clone <url>
git submodule update --init

Two very simple commands.

Now, Git is NOT a software deployment tool! So, what we're trying to do, is to have Jenkins/Hudson use the deployment-platforms to run tests and build Debian packages for Drupal, modules, themes, libraries. Those packages are automatically injected to our package repositories, which later are used by APT and Puppet for deployment to the target systems.

Working with the deployment-platforms during development is fairly easy and involves only a few extra but simple Git commands like:

git submodule add
git submodule sync
git submodule update --init

Coming back to the important missing feature I mentioned early. With the approach dob takes, it not possible to reuse the contrib Drupal repo across multiple projects. You could probably have different branches for different project, each branch having different submodules at different version. But that's a mess when switching branches because Git at the moment is not able to remove unused submodules when switching branches. Then, depending on your workflow you might have different branches per project (master/staging/production) so that you would have 3*(number of projects) branches plus all the branches coming from upstream, plus the features branches you use for development. This is worse than ugly!

I like the idea and approach of dog and I hope our approach gives you some ideas for dog.

Lot of great points here,

sdboyer's picture

Lot of great points here, thanks for the thoughtful response. Took me a while to respond because I was thinking a lot about it :)

Coming back to the important missing feature I mentioned early. With the approach dob takes, it not possible to reuse the contrib Drupal repo across multiple projects. You could probably have different branches for different project, each branch having different submodules at different version. But that's a mess when switching branches because Git at the moment is not able to remove unused submodules when switching branches. Then, depending on your workflow you might have different branches per project (master/staging/production) so that you would have 3*(number of projects) branches plus all the branches coming from upstream, plus the features branches you use for development. This is worse than ugly!

First and quickly, on the annoyances of submodules (such as them sticking around across branch switches): yep, and if you read over some of the other comments in this thread, you may see that while I was initially inclined towards always using submodules everywhere, problems such as that, as well as the spammy history it can result in, led me to only use them in certain cases. Where submodules are used and a branch switch occurs, the post-checkout hook is where we do the cleanup. Getting the right hook into place in an automated fashion is one of those things dog is good at doing.
Feasibility of sharing a single contrib repo (e.g., Views) across multiple site projects: when I was first thinking about dog, the case where a shop would want a base install to start with was actually near the front of my mind. And yes, if you use a remote configuration that's unmodified from the base set up by git clone, e.g., this:

[branch "master"]
        remote = collab
        merge = refs/heads/master

then you'll have a proliferating mess of branches. But do just a little namespacing magic, say for a project called "projectfuntime":

[branch "master"]
        remote = collab
        merge = refs/heads/projectfuntime/master

And your local master branch is linked to a namespaced branch in the collab repo called "projectfuntime/master". (There are a number of other tricks, but that's a start). It is true that there's no git-native way to just fetch a namespaced subset of branches with a glob (fetching a single branch is quite easy, but we also need to be able to discover feature/hotfix/etc. branches), which means the remote listings will get cluttered - unless you do some cleanup in some custom porcelain. Like, say, dog.

So, I strongly disagree that it's impossible to reuse the collab repos on multiple projects. In fact, I don't actually see how, looking at your description, your pattern makes anything more reusable than the above outline for dog does - it's still just submodules pegged to a specific ref. Dog can do that, or it can attach to the repository more loosely. Where the subrepo is ultimately placed within a particuarl dog-managed project is irrelevant to reusability - you still have all the same basic problems of many-branches-to-contend-with, unless you make the assumption that all your projects will be in lockstep on their shared repos.

Also, a quick note - I'm hoping to roll Phing in to dog somewhere, also probably pretty transparently to the top-level interface, as the tool simply won't be that useful unless you have a way of managing target-specific variations between instances (e.g., differing mysql connection strings on your local dev vs. staging boxes). Build targets ftw.

(On side note: I think it's a terrible idea to call that remote contrib, Git best practice is to have an origin and an upstream remote!)

It's collab, not contrib - contrib wouldn't make sense at all. I disagree that it is a "best practice" to use origin, especially in a case (like this one) where you have two very important remotes that need to be interacted with regularly. Using "origin" works best when there is just one remote repository you interact with regularly; I came up with this convention in the first place specifically because I wanted to avoid those connotations, and the names reflect the purpose.

The solution we came up we like to call deployment-platforms and it's basically a very small Git repo with a lot of submodules. It looks like this:

I considered a repository layout like this, with an outer super-repo that contains the core clone. . Truth is that there are advantages to having the base dog repo != to the webroot - a lot of potential advantages, and it's my preferred solution for the long term. Many sites, especially bigger sites, want to put stuff into the repository that really shouldn't be under the webroot, for example. I initially rejected it, though, because it hadn't occurred to me that we could place all repos in the thin, outer super-repo, thus avoiding the hellish commit noise of submodule commits from a module echoing first up into the core repo, then again up into the super-parent dog repo.

However, having reflected for a few days on this layout you've suggested, I'm now about 60% sure that I'm going to take a note from it. We'd adopt a strategy where we have a base repo that contains ALL other repos, including core (there's a nice symmetry to that). The biggest drawback is an ease-of-use consideration - before, you only needed to set up one collab repo per project. Making this switch would mean at least two are necessary - the dog super-repo, and the core repo. I worry that that could be just enough of a hurdle right at the outset to keep people from adopting dog.

Now, Git is NOT a software deployment tool! So, what we're trying to do, is to have Jenkins/Hudson use the deployment-platforms to run tests and build Debian packages for Drupal, modules, themes, libraries. Those packages are automatically injected to our package repositories, which later are used by APT and Puppet for deployment to the target systems.

We're pretty well in agreement there. Neither Git, nor even dog, are deployment systems in and of themselves. I think people get confused because they're mistaking necessity for sufficiency: you need good versioning & collaboration tools to build your project in the first place. Add in the fact that git can do all kinds of transport, and it's easy to make the mistake of thinking that deployment is just a hopskipjump away from your normal dev process. But it isn't - or if it is, you probably lucked out. Dog is really about building that portable package that a true deployment-focused tool can easily roll out elsewhere. The set of tools you've described are a shining example of a way of how these tools are package-builders that slot nicely into tools which really do deployment and provisioning. Actually, I'm very interested by what you guys have set up and would love a tour, if that's possible? :)

... Now, Git is NOT a

cweagans's picture

... Now, Git is NOT a software deployment tool! ...

...We're pretty well in agreement there. Neither Git, nor even dog, are deployment systems in and of themselves....

I disagree with you both here. There's no reason that Git cannot be used as a deployment mechanism. It's very handy to be able to do git push www1 master and the post-receive hook on that remote runs git checkout -f into the worktree (expanded guide here: http://caiustheory.com/automatically-deploying-website-from-remote-git-r...).

Something like that scales for multiple web heads, and makes for nice easy deployments, IMO. There's the added benefit of "I already use this", so I don't have to learn yet another tool to do a deployment.

--
Cameron Eagans
http://cweagans.net

Using the wrong tools for the wrong tasks

Pisco's picture

This is going a bit off-topic, I don't mean to hijack this thread, but I've got to respond to this.

Excuse me for being blunt, @cweagans, but what you say is naive and plain wrong.
Git is a DVCS, full stop. It's a brilliant piece of software, that's why it can be (ab)used in ways no one would have thought of. Of course it can be abused for software deployment, I use it for that too (unfortunately). But not because it's my tool of choice for that job, it's not even suitable for it! I just didn't find the time to assemble the proper tools and develop the missing pieces for the targeted workflow.

When you set up web-servers to run Drupal on them, do you install Apache, Postfix, Nginx, Varnish, APC, PHP, … with Git? I hope not! Do you download the source code, configure, build and install it manually each time? I hope not! Why do you think that should be the way to do it with Drupal?

Git lacks basic features of a package management system. But that's not a problem, because it was not designed as a package management tool. Deploying and installing software is not the purpose of Git.

Of course one can use a chisel instead of a screwdriver, but those tools serve completely different purposes. If you're doing your job well, you're not going to use one for the other. And you shouldn't be telling others to do so.

I'd love to see Drupal people spending there time building reasonable and useful tools, embracing what's already there (packet management: apt, yum, macports, …; configuration management: puppet, chef; continuous integration: jenkins, hudson, …) and respecting best practices, instead of wasting their time thinking of ways how to abuse existing tools and letting others believe that this is the way to go.

The guys at Debian (and Ubuntu), do a proper job at packaging Drupal and modules for those platforms, and then there's the wonderful dh-make-drupal. But, for good reasons, their package repositories do not hold recent enough packages, at least not recent enough for us, that's why I'd like to build a workflow that involves Git, Jenkins, dh-make-perl and reprepro.

I'd love to see d.o providing package repositories (deb and rpm) for Drupal itself and its modules. I think this is the only way to go if Drupal wants to be a grown up, industry grade project. An initiative for Drupal 9 or later? I hope so!

Giving a tour

Pisco's picture

Hi Sam

Sorry for letting you wait so long for a reply and thank you for reading and replying so thoroughly to my post!

I'd be happy to give you a tour of our setup. Either this August in London, or earlier through any means of communication, if you wish.

DrupalCon London BOF notes

michaellenahan's picture

Hi there, dog gang!

Here my rough notes from today's BOF.

My own use of git is basic - which is why I think dog is such a cool project - but it also means the lines below will need some elaboration (and/or correction) by others.

So, what can people work on for dog?
At the meeting some folk volunteered for some of these but I didn't catch which ...

  1. Config Editor: A way of managing edits to the sled.xml manifest file
  2. A simple method to ensure all repos are clean
  3. Rollback to a known state if something goes wrong during writes to git repos (transactional consistency) using git reflog
  4. dog needs signal handlers - pcontrol php system
  5. Work on vet command - a full inspection of the system - this will emit results in json format
  6. drush archive dump - redhatmatt will look at this
  7. optimize ways of putting mysql dumps into git
  8. libgit2

Use Gitolite instead of Gitosis

Pisco's picture

During the BoF in London I forgot to mention a recommendation I wanted to share with you. During the presentation Sam mentioned gitosis for hosting your own Git repositories. Rather than using gitosis I'd recommend using Gitolite. Unlike gitosis, Gitolite is actively maintained and much more feature rich. You will find a how to in the Pro Git book by Scott Chacon: http://progit.org/book/ch4-8.html. There's an easily installable package for Debian and Ubuntu!

Ah yes - I should have

sdboyer's picture

Ah yes - I should have mentioned both. The key feature is on-demand creation of new repositories, which apparently gitolite supports as well. I'd only known it to be a feature of gitosis, but apparently gitolite does it now, too.

I've been using Gitosis for a

Antoine Lafontaine's picture

I've been using Gitosis for a while now, but would also recommend Gitolite over it just for the feature set and control granularity.

I really like the idea of

Antoine Lafontaine's picture

I really like the idea of defining a set of common methodology to manage a drupal (pick one) site/app workflow, but I wonder why I don't see much more opinions on using submodules vs. using subtree merges (except from the gentleman here http://groups.drupal.org/node/140949#comment-501244 ) for managing contrib modules in the proposed workflow.

I do not have a complete proposition, but I'll paste an example of how I've managed contrib modules (including drupal core) in a test project repo, trying out a workflow to manage contrib modules in a Drupal project.

git init

#Creating a staging branch - could follow any environment/branching naming/convention your team uses
git checkout -b staging

#Adding Drupal core as a remote
git remote add drupal_core http://git.drupal.org/project/drupal.git
git fetch drupal_core
git checkout -b drupal_core drupal_core/7.x
git checkout staging
git read-tree --prefix=htdocs -u drupal_core
git commit -m"Added drupal core"

#Now adding Views
git remote add modules/views http://git.drupal.org/project/views.git
git fetch modules/views
git checkout -b views views/7.x-3.x
got checkout staging
git read-tree --prefix=htdocs/sites/all/modules/views -u modules/views
git commit -m"Added Views"

I hope this is clear enough to give a basic idea of how this could easily be automated and simplified using drush. (which I considered at one point but haven't done yet) "Dog" could be highly opinionated about many (default) decisions made on how we handle subtree merges naming conventions for remote naming (modules/module-name, modules/core, and whatnot) and where read-tree (the --prefix target) should put a module, theme, libraries, core and so on.

I've already mentioned this a bit up there in the thread, but there's a good read in the Pro git book on how to use subtree merges as an alternative to sub-modules. http://progit.org/book/ch6-7.html

Again, if I'm missing something please guide me to the light :)

Thanks for all this by the way, I'll be keeping an eye on how things progress.

So this made me at least

sdboyer's picture

So this made me at least revisit the idea of subtrees, as I realized I'd dismissed them too summarily before. And there's some key aspects of it that really are pretty nice...but also some significant drawbacks. Here's my basic pro/con chart:

Pros

  • Gives us a more unified, simple actual working tree with little work than with full repos, where we need to symlink them around for them to reach their final, working location.
  • Eliminates issues related to double-committing changes in submodules (since you have to first commit in the submodule, then in the parent repo).
  • Much more transparent to any GUI tools, since it's all a unified tree in a single repo (and virtually every GUI tool is designed to interact with one repo at a time)
  • There's just somethin really nice, clean and elegant about picking up one git root tree object and attaching it to another tree object.

Cons

  • That whole ability to contribute back easily to an upstream project disappears, since you no longer have a real repo with full history to operate in. It would be particularly difficult to deal with it when we start having per-issue repos and you want to track some of the work being done in them, since you're playing with a remote that's already very transitory.
  • The specific strategy described in your example code (and in progit) would entail some very destructive behaviors for actually *doing* an update - you'd be shifting your entire site out while you quickly switch and update your local branch. This could be avoided by directly merging from the remote branch instead of keeping a local tracking branch, but that's not ideal because it means you're chasing a branch, not a tag. Which leads to tags...
  • Git deals kinda poorly with tags; unlike branches, which have this separation between local and remote branches, when you pull down tags from a remote, they're (by default) automatically brought into the default local tag namespace, refs/tags/[tag name]. This is manageable by creating a custom refspec for tags in one of the remotes that's used for subtree data in order to pull them into a separate namespace (e.g. refs/tags/views/[tag name] for views tags), and/or by skipping the retrieval of tags entirely by using git fetch -n or remote branch trimming. But the bottom line is that there's some pollution of the tag namespace that will occur - Views alone has 63, as of this writing. That would need to be carefully managed.
  • There's an unfortunate reality when it comes to adding new remotes that have no commits in common with the existing repository: it's fucking slow, and gets slower with each additional repository. git iterates over every single one of the known commits in the local repo in order to find common commits with the remote repo. We noticed this a while ago when we were big colocated cache repos in drush, and it got unbearably slow (Fixed it in this issue: http://drupal.org/node/1242200). Like maybe five, ten minutes to add a new remote...and that's on an SSD.

I feel like I've missed a couple things there, but it gives a sense of the shape of it. Ideally I WOULD like to be able to incorporate subtrees for at least some stuff, but those cons are enough for me to not want to incorporate them right now.

Thanks for taking the time to

Antoine Lafontaine's picture

Thanks for taking the time to give some thought to this and for the comprehensive reply.

I still need to digest some of the points you're mentioning in the "Cons" since I feel I do not grasp all of it yet...

Anyhow, I'd still like to point you to a few things I've found in my research that I feel might be interesting:

  • Concerning the ability to contribute back, I would be thinking that any code that would be modifying some (eg.) module, would be done on a separate branch, based on the proper remote (eg: git checkout -b module/views-fix-something module/views/7.x-3.x - based on my simplified proposition above) then that fix could be sub-tree merged into the "project branch", bringing the changes into your project, but also giving the possibility to submit a patch to views upstream. If the changes gets committed upstream, you pull those changes to your "pristine" views branch, (re) sub-tree merge that one back into your project (maybe it requires removing it from your project first) then you can just remove your fix-branch from your repo to keep things leaner.

    There's also an interesting short post here: http://posterous.timocracy.com/git-sub-tree-merging-back-to-the-subtree-for
    And a plugin that handles subtree merges and splits: https://github.com/apenwarr/git-subtree

    Haven't tested that yet though.

  • When you mean destructive, are you implying that this would require to change (checkout) a branch on a repo that is used on a production site?... I'm not sure I fully understand the destructive nature you're describing.
  • My understanding of tags is limited, but I think I understand the issue here. This is indeed to be avoided. Will have to read more and try it on my own to understand the issue better.
  • I didn't know about this issue and that explains why this takes so much time to get new remotes in the repo... quite unfortunate.