Drupal core initiatives: Working with git

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

This is a summary of the main points I got from a long IRC chat I had with Sam Boyer about how the initiative leads should best work with git. (heyrocker)


One sandbox vs multiple sandboxes
One sandbox - All code in one place, separate projects into branches. Bummer is that issues are all mixed up in one queue.
Multiple sandboxes - Get nicely isolated issue queues, but code is scattered all over, and it can make it harder for someone wanting to get an overview of the entire initiative. Also people working on multiple bits have to jump all around.

Recommendation: One sandbox per initiative

Merging back into core
[09:28] sdboyer there needs to be some planning around how you want the merge workflow to work back into core
[09:28] sdboyer the two basic approaches i can think of are:
[09:29] sdboyer 1) one branch, like an 8.x-config, which is defined as being *the current, unified state of the cfg mgmt initiative and then lots of topic branches which are based on it
[09:30] sdboyer or, 2) multiple topic branches that pop straight off of main core's 8.x, and are directly reintegrated as each segment is finished
[09:31] hejrocker My inclination is that we do the latter, at least at first
[09:32] hejrocker because we have a lot of work going on in areas that are reasonably isolated
[09:32] sdboyer i agree
[09:32] hejrocker like the config-in-json-on-disk stuff is completely separate from the uuid stuff
[09:32] sdboyer right
[09:32] hejrocker and we want to be able to pull that stuff in in pieces
[09:33] sdboyer whereas wscci has a fairly unified line of work that probably deserves a central initiative branch pretty much from the word go, cfg mgmt does indeed seem separate
[09:33] hejrocker right

Will depend on the nature and workflow of the initiative but I think we agree that it makes sense for every initative to have at least one branch, rather than just cloning from core and working in there directly.

Branch Naming
Create a meaningful branch name that includes the version number (aka 8.x-uuid)

Initial Setup
Take a clone of drupal core
git remote add
git push 8.x

Next steps depend on workflow. Every initiative will have at least one feature branch

git checkout -b -8.x
git push

General principle is that we merge from mainline as little as possible, and when we do we provide a meaningful commit message so that when this gets merged upstream back to core, the commit log isn't filled with 'Merge from '. For day to day testing you can merge your topic branch back to head and test there and reset when you're done, or merge with --no-commit and reset.

When Dries merges these topic branches back to head he will have several options to make the commit log clean, but this is not really our problem. Ideally though, we'll make his job pretty easy by keeping our topic branches clean in the first place.

Comments

On merging back into core

dww's picture

I think it's imperative that initiative maintainers are responsible for keeping their mainline branches "clean" for the final merge back into core. Given the huge demands on Dries's time and his current status as the sole bottleneck for D8 patches being committed, he already has trouble getting the commit message right when he commits single patches. ;) Therefore, any workflow that ends with "and eventually, Dries will get to cleanup the history and rebase/cherrypick the right commits to merge back into core" is a non-starter. We have to assume Dries will have basically 0 time or desire to do anything fancier than "git merge" when the time comes to finally bring code from an initiative sandbox back upstream into the 8.x branch of node/3060. So, we should just plan from the outset that any attempt at making the history sane/useful/pretty will have to rest with the initiative maintainers themselves, not Dries.

And to be clear, that's not a swipe at Dries. He shouldn't be spending time worrying about these details. Although they're important, there are a ton of other things that are vastly more important since only he can do them (like sign off on the direction of non-initiative patches, etc). Keeping the merge history sane is something that can (and therefore should) be offloaded to the initiative maintainers.

Cheers,
-Derek

I agree with 100% dww -

pwolanin's picture

I agree with 100% dww - keeping full history is just silly and is going to inhibit a natural workflow of committing and pushing often for collaboration. Note that I don't think you need a clean "mainline" during ongoing development, but rather you should create one via squash merging once the feature is RTBC (or close to it).

I think we should have some reasonable guideline like 1-10 commits in the branch that's prepped for merge into the core repo, and those should be massaged to show the essential evolution of the feature rather than garbage history.

My $3.50...

webchick's picture

A couple of things jump out at me, here... (Note: This is webchick, not "webchick speaking on behalf of Dries," in this context.)

When Dries merges these topic branches back to head he will have several options to make the commit log clean, but this is not really our problem.

Copy/paste everything dww said. There's absolutely no possible way for a core committer to be able to read through a ginormous log of commits and make a per-commit decision about whether or not it should be squashed. Only the team who actually made those commits can make that call. This needs to be done by the initiative team, before RTBC.

General principle is that we merge from mainline as little as possible, and when we do we provide a meaningful commit message so that when this gets merged upstream back to core, the commit log isn't filled with 'Merge from '.

This sounds like it would result in initiative sandboxes getting terribly out of sync with mainline development, resulting in frustrating conflicts when merges do occur, and all but asks these initiatives to work ahead blindly without knowledge of what's going on around them. I really think this is a dangerous way to go. :\

I'd propose the following counter-workflow to address both of these points. I'm sure this isn't totally right, but this is more what I had in mind:

  1. Create a sandbox project, per initiative: CMI, WSCCI, etc. Assign co-maintainers as appropriate.
  2. For each major "chunk" of work:
    a. Create a "meta" issue in the "Drupal core" queue to announce major progress points to the larger core developer community. ("Add UUID support to core")
    b. Create a feature branch in the sandbox for the work. (uuid)
    c. Create X issues in the sandbox project to track the things that lead to that "meta" issue being fixed. ("Create a UUID management API", "Add UUID columns to all core entity tables", "Add PHPDoc to that change you just made," "Now fix the stupid typo", etc.) Probably use a per-"chunk" tag to track them ("uuid", "json", etc.).
  3. Code your brains out. Commit typo fixes. Merge from upstream. Whatever you need to do. Develop as you normally would.
  4. At any good "stopping point" when you'd like the community to review what's done, post a patch of the work so far to the "meta" issue. This invites a testbot check, and also gives opportunity for incremental feedback from the community. How often this happens will depend on the initiative, but aiming to do this once a week or so would be a good goal.
  5. Keep working until RTBC in the "meta" issue.
  6. Once you hit RTBC status, pull the commits in your feature branch to a "staging" branch (uuid-staging), and at this point squash your commits and whatever other clean-up is required.
  7. After the final, definitive RTBC in the "meta" issue, post instructions for merging in the proper branch to the core committers. (git merge [hocus-pocus])
  8. If further amendments are necessary, do them in the staging branch.
  9. The change gets pulled into core! Everyone rejoices! Ponies sing and kittens dance!

I think this addresses the concern about the core merge getting a lot of static in it ("Merged from upstream" blah blah) while at the same time adhering to standard development workflow we'd use in our own non-d.o projects.

One last thing, which is probably pretty contentious, but...

I don't think git merge is a good idea, and would prefer to stick with patches for the final commit of initiatives. OUT, OUT, BLASPHEMERRRR, I know. :( :(

But I say this because of the following:

We originally had the patch workflow in Git instructions documented as creating a feature branch, committing one or more changes, then git format-patch, then apply with git am. The idea was this glorious utopia where we could provide patch authors real commit credit. Huzzah! The problem? Creating patches went from two easy-to-remember commands to "you need to understand the entire distributed version control workflow plus how branches work in order to upload a patch."

Further, git format-patch on more than one commit creates a patch that is totally un-reviewable, so that was causing some rage from several contrib authors (such as pwolanin, actually ;)).

So these "best practice" instructions were reverted fairly quickly after initial launch to the current instructions, which are just git diff and git apply, very analogous to what we have in CVS. Since this change, I've heard no bewildered people in IRC unable to figure out a patch workflow, unlike before where this was an hourly occurrence (some just gave up).

Why do I tell this rambly story of times gone past? Because 99.9% of core changes are going to use this patch workflow. As a result, if we go with a merge workflow on initiatives:

  1. Our commit logs are going to become a total mish-mash of inconsistency, and extremely difficult to parse for commit statistics and other things.
  2. It'll create inequality because some patch authors will get direct credit and others only as a CSV.
  3. And yes, I know I can specify an author per-commit with some flag, but the basic problem with that is there's only one author per commit. A HUGE portion of core patches are a result of a collaborative effort between multiple people and it would be unfair to pick just one.
  4. Also? I'd have to look up the preferred Git e-mail addresses of all these people and seriously, eff that. I'd much rather spend that time committing/pushing more patches.
  5. Finally, there is no way to credit patch reviewers in this model, and that's something I'm very adamant about supporting.

So the reason I go off on this huge tangent is because I see a lot of energy here being centered around how best to do git merges into core, and I'm not even sure that's feasible right now. Once/If we have per-issue repos, then doing git am would make a lot more sense, because then all core commits would be merges, and it would just be #5 to be concerned about.

So, given that, I think we can amend the above proposed workflow with "who gives a crap about squashing and cleaning up the commit log." Meaning, eliminate 6, 7, and 8 from the list, which makes this even simpler and easier to follow.

I eagerly await your slings and arrows! :)

totally agree

Gábor Hojtsy's picture

Something I take important with Drupal 6 is also giving reviewers, testers credit. That does not work with a commit log.

Also, I'm stubborn on some core practices sorry. Getting old or something :) So I've never even tried that git branch patch flow you mention from the git instructions tab, once I've seen it I knew it was not going to fly :D I'm glad it is gone.

For a patch workflow to work here though, since there is not going to be an issue history of all the work, we need the initiative leads and devs - the patch submitters - to summarize a list of people for the committer. I think this helps and even makes the commit happen faster.

Thanks for writing this up,

sdboyer's picture

Thanks for writing this up, webchick. I think this makes sense, and am generally +1. Actually, I don't think the first part is that different from what I originally discussed with heyrocker. There are some notes, though.

First, lemme respond to the point that you and dww both made about it being up to the initiative leaders to massage the final tree for the merge: yes, absolutely. I wasn't specific enough about that when I originally talked with hejrocker, but that was my thought from the start - core maintainers don't have time to massage. It's definitely up to the initiative leader to present the final branch for merging.

This sounds like it would result in initiative sandboxes getting terribly out of sync with mainline development, resulting in frustrating conflicts when merges do occur, and all but asks these initiatives to work ahead blindly without knowledge of what's going on around them. I really think this is a dangerous way to go. :\

I think merge from mainline as as little possible may not have been the best choice of words, as first David and now you have interpreted this in a way that I really didn't mean. "As little as possible" is not so accurate; "for a good reason" is a better way of putting it. Question is, what constitutes a good reason? My standards are based on my belief that initiatives will be best-served if the leaders provide a stable, reasonably current public branch(es). If initiatives are truly long-running efforts, then it's perfectly acceptable to lag a bit behind mainline - so long as relevant updates to mainline are merged in once pushed. In this sense, I see part of the initiative leader's role as maintaining their branch - ensuring that any relevant mainline changes are brought in immediately, but also insulating the branch from upstream bugs & instability by not chasing after it "just because."

I think this is one of the key benefits of Git (or any dvcs), though it is somewhat subtle. In the days of CVS, there was just one canonical mainline, and everyone had to stay in lock step and chase it all at once. In other words, this challenging scalability issue. Git relieves us of that burden - because it can merge well and easily, it's really ok to fall a bit out of date. I promise, the world won't end. Checking to see if you merge cleanly & if all tests pass post-merge is easy. So let's use initiatives that way: let the leader curate and make sure that relevant mainline changes are pulled in, but balance that against generating unnecessary noise and potential instability for initiative collaborators.

Linus has a famous little email related to this. See especially the sections on merging upstream & downstream. In fact, rereading it, I'd forgotten how much I'd taken from it. His standard is, "can you write a non-spammy commit message for that merge?" So is mine.

Now, as for the latter, blaspheming parts...let me take this bullet by bullet:

Our commit logs are going to become a total mish-mash of inconsistency, and extremely difficult to parse for commit statistics and other things.

Commit statistics have to look at the entire history. So if we EVER switch to merges, this will be a problem. Makes no difference if we do it now or later, so no reason to delay.

It'll create inequality because some patch authors will get direct credit and others only as a CSV.

Yup. But that's also going to be the case even when we do switch to per-issue repos, as my realistic assessment of that at this point is that we'll still need to allow patches.

Really though, this is a question of generating better statistics, which is something I've been thinking about and anticipating for nearly a year now. There are really easy ways to do it right up front, too - if we want to keep CSV stats on the level, then we can just exclude commits that are on a merged branch, or count them only once.

And yes, I know I can specify an author per-commit with some flag, but the basic problem with that is there's only one author per commit. A HUGE portion of core patches are a result of a collaborative effort between multiple people and it would be unfair to pick just one.

If you can give credit that way on a normal patch, great. Otherwise, that's fine - keep on with the standard formatted commit message. It WILL have to be a hybrid.

Also? I'd have to look up the preferred Git e-mail addresses of all these people and seriously, eff that. I'd much rather spend that time committing/pushing more patches.

Yup, that sucks. Fortunately, I'm about ready to push http://drupal.org/node/1127320 live, at which point it seems like dreditor will be able to do it for you.

Finally, there is no way to credit patch reviewers in this model, and that's something I'm very adamant about supporting.

Untrue, though I haven't talked about this at all, so understandable. git-notes is what we want. Notes let us append data (like reviewers, but could be anything) onto a commit similar to the --sign-off option on git commit. Unlike --sign-off, which appends data directly into the commit message, therefore changing the hash of the commit, notes are separate git objects that, while attached to the commits, can also be modified at any time later WITHOUT changing the hash.

If we decide on a convention for notes, setting up a local dev repository to use them properly can be done easily via a one-time run of a script. There's no harm if someone doesn't set it up, and since notes can be made at any time, they can always be added later. I have to double-check for complete certainty to make sure there are no potential potholes with this, but it's a very good system with a ton of potential. FWIW, it is also the system that core git folks prefer for attaching additional commit metadata. Once the conventions are set up, we just teach our statistics gatherers to read the data, and ipso facto, stats become a LOT easier.

core maintainers don't have

webchick's picture

core maintainers don't have time to massage. It's definitely up to the initiative leader to present the final branch for merging.

Cool.

His standard is, "can you write a non-spammy commit message for that merge?" So is mine.

After reading your response and Linus's referenced e-mail, I think this stance makes sense. So basically instead of merging every morning when you get up and are sipping your tea, merge in when a new unstable release happens, or merge in when a major patch that affects your initiative gets committed. Otherwise, leave the sandbox alone so you're working against something relatively stable, and untangling annoying conflicts only need to be dealt with periodically when there's relevant stuff.

Sam and I had a follow-up discussion on IRC after this about the credit parts, and I'm trying to recall all the details, but it basically boiled down to:

  • Sam's stance is basically that:
    a) We'll always need a hybrid first-class/second-class commit credit model, even with per-issue repos.
    b) We're going to need to make this switch to first-class commit credit sooner or later; might as well do it sooner since we're early in the D8 release cycle; we're losing history every time we don't.
    c) Per-issue repos are still vaporware, so he's concerned about pushing off first-class credit until they exist, which could be months or years away, or also never.
  • My concerns are:
    a) Creating a special "first class" commit credit for some special people but not others creates inequality, and inequality creates bad blood in a collaborative team.
    b) Because the only way to get that special first-class credit (commits to "Drupal" showing up on your user profile) is through keeping your code out of patches and in another git repo somewhere (github, sandbox, etc.) it will encourage people to do that, rather than use the core issue queue which is what reviewers/committers actually read.
    c) We've had demonstrably terrible problems already with initiative visibility in non-core issue, external sandboxes (Bartik, Field API, etc.) and I'm concerned this will only make fragmentation/visibility worse. :\
  • Sam's counter to my credit tier concerns is that if credit tier concerns are that big of a deal, it's easier to make adjustments to VCAPI views and the like than it is to develop per-issue repos. Well, I can't exactly argue with that. ;)
  • Sam additionally noted that the concerns with patch reviewer credit could be mitigated by using git notes. A template would have to be worked out to figure out how to do this properly.
  • Once http://drupal.org/node/1127320 is resolved, my concern about having to click 18 times on each patch to figure out what e-mail address to use would be mitigated. It looks like sun is working on a Dreditor plugin thing for it, presumably to auto-generate a commit command to copy/paste.

I feel like this is a fundamental community management issue, and really needs much wider discussion than a several-week old thread buried in a group that only a handful of people can subscribe to. I'll do some pondering about the best way to tackle that, but probably wouldn't have the ability to lead such a discussion for another couple of weeks.

In general, +1 on all of

sun's picture

In general, +1 on all of webchick's concerns.

Regarding git notes, I don't quite see or understand how that would be an improvement to stating credits in the commit message. Unless I'm overlooking something, git notes don't have an --author flag or similar that would result in actual user history data, so even though git notes look very interesting for a multitude of possible use-cases, in this case, we'd be replacing one BLOB of text with another BLOB of text, whereas the new one needs extra work to create...?

Daniel F. Kudwien
netzstrategen

The advantages distill down

sdboyer's picture

The advantages distill down to significantly simplified parsing and normalized form for expressing user information.

  • The advantages distill down to significantly simplified parsing and normalized form for expressing user information.
  • Remember from core tag clouds how "moshe" and "weitzman" show up separately? We can be done with that. No more space, pipe, \0x$(RAINBOW) delimiting of contributor information that, frankly, is out of band to be at the beginning of a commit message anyway. Everyone's identifying user information is expressed in a normalized form, newline-delimited in a note. Far more readable, and we can also put more and better identifying info in there.
  • We can define note namespaces - e.g., one for reviewers, one for contributors, one for robot chickens. Which means we're not exchanging one BLOB for another. We're exchanging a using a subset of the data in one big messy BLOB for dedicated metadata BLOBs with structured rules, and a command set that makes it easy(er) to obey those rules. If it weren't a good idea, then HTTP wouldn't use headers and would just jam everything into content.
  • Notes can be applied/modified after the commit has without changing the commit hash (and therefore invalidating all subsequent history), unlike anything that touches the commit message itself. Not so important for contributors, but essential for reviewers, especially with per-issue repos.

A very accurate chronicling

sdboyer's picture

A very accurate chronicling of our debate :)

If we can use git notes for

pwolanin's picture

If we can use git notes for reviews, then when not use it for everything?

We can (apparently) go back and add them to existing D8 commits, and use them to indicate multiple contributors to any patch. Can we have a server-side hook to validate each contributor noted in a note (to avoid the misspelled name problem)?

Seems to me having one, consistent approach is more important here - I think pulling in others commits will tend to cause confusion about the actual system for people newly coming in.

If we can use git notes for

sdboyer's picture

If we can use git notes for reviews, then when not use it for everything?

Pretty much my thought. They're the route Git core people really want us to use for adding metadata to commits. Hell, we don't even have to stop at D8 - once we agree on a standard, we could add them all the way back to the beginning. If we wanted.

Forking a project into a sandbox

sun's picture

I've written a handbook page that contains git best practices as a step by step guide for how to fork a project into a sandbox:

http://drupal.org/node/1181472

Thanks @msonnabaum for walking me through the best practices and steps in IRC. :)

Daniel F. Kudwien
netzstrategen

Why is this an "invite-only" group?

pillarsdotnet's picture

So I posted Patch guidelines should recommend "git format-patch" rather than "git diff" and I was referred to this group, which proclaims at the top of the page:

  • This is an invite only group. The group administrators add/remove members as needed.

How do I become a member of the elite team that actually receives updates on this sort of thing, so I can stay informed?

The Drupal 8 Initiatives

webchick's picture

The Drupal 8 Initiatives group is invite-only because unfortunately OG has no other way to allow for curated, announcement-only groups. Patches welcome at http://drupal.org/node/1187832.

I've cross-posted this thread to the "Drupal and Git" group http://groups.drupal.org/drupal-and-git, which seemed the most reasonable place to have this as a broader discussion. You can subscribe there.

In a newly-opened Docs issue

eliza411's picture

In a newly-opened Docs issue about the frequently discussed topic of attribution and Git, http://drupal.org/node/1203964, I mistakenly tried to point the author here, totally forgetting this is a targeted conversation about specific concerns taking place in an invite-only group.

Where is the appropriate place to direct people wanting to engage with the community-wide implications of patching, merging, Git and attribution? The Docs queue just doesn't seem like the most effective venue.

Improvements to core?

jhodgdon's picture

There is an "improvements to core" group (http://groups.drupal.org/improvements-core)... would that make sense? This group here (initiatives) I thought was for the Official Dries-Blessed Drupal Initiatives, not for any random community initiatives. Maybe we need a less-strict Community Initiatives group?

Other groups that might be appropriate: Coding standards and best practices (that's probably better than core improvements and seems like a good place to bring up Git process questions?) -- http://groups.drupal.org/coding-standards-and-best-practices

Not using a sandbox

Jacine's picture

Based on feedback I've received from team members of the HTML5 initiative, I've decided that using a sandbox would add another layer of complexity that we don't need or want. We'll likely have a lot of issues with small scale changes, which will be scattered across multiple components of Drupal. I really want to keep them in the core issue queue so that they are visible to a wider audience and most importantly the subsystem maintainers they affect.

The team members don't want to have to worry about stale code, and working with sandboxes and branches, etc. Many potential contributors don't even know how to roll a patch, so we'd like to lower the barrier here as much as possible, and eventually make them feel comfortable contributing to all areas of core.

I've been told by webchick and Dries that this is fine. However, I'm still a tad confused about how the process of committing will work. The other initiatives all have sandboxes and their commits will go in via larger patches in meta issues, which webchick and Dries will ultimately give a once-over and commit. So, how would this work for the HTML5 Initiative?

dww's picture

I totally support the decision not to use a cloned sandbox of core if you don't need that level of complexity. If your work is going to be small scale chunks in separate issues, totally independent of each other, spread across all of core -- all the better. In that case, the issues should be handled just like always:

  • each issue is as small and discrete a change as possible
  • each issue has as small a patch as possible to implement the desired change and tests for it
  • anyone reviews and improves the patches through the life of the issue
  • when it's RTBC, it's a regular commit with the usual commit workflow

Not exactly sure what you're asking, to be honest. ;) If they're just regular core issues, they're regular core issues, using the tried-and-true workflow we're all familiar with.

Do you foresee any problems with that?

I think the only place you might get into trouble is if you start getting into patches that depend on other patches. Our current tools don't handle issue relationships all that well. However #1036132: Provide a mechanism for issue summaries should significantly help that. If you have patches that are blocked on other issues, I'd suggest marking them "postponed" with a [#nid] like to the parent issue that needs to be committed first.

If you have a specific thing you're worried about, please let us know.

Thanks!
-Derek

I misunderstood...

Jacine's picture

Thanks Derek! :)

Ok, so basically I misunderstood and thought this meant that I would be the one committing these issues directly. I was worried about the fact that it would have been different from the other initiatives (how their patches would go through Angie/Dries first), and I was also about knowing when something could actually be committed because of the critical and major limits, etc.

BUT, luckily as @webchick just explained to me, I wouldn't be the one actually committing. I'd be pinging Dries when I've signed off and issues are RTBC.

So, YAY, I'm all clear now. Phew. :)

still pondering about the ideal workflow

Gábor Hojtsy's picture

Just had yet another discussion about this with plach in IRC. Still pondering about how a git clone and a tree of branches could help us. Within the multilingual initiative, we have several relatively big issues which are only marginally dependent on each other (mostly not dependent) BUT touch the same code. So they can theoretically be applied in any order, and in the interest of enabling progress as much as possible, I think its best to let them be applied in any given order prefered by core maintainers.

Some huge issues include

There are also many D8MI issues which are not multilingual issues per say. Making blocks entities and similar issues just enable us to have multilingual support for them and I don't expect we'd work on them in a multilingual branch/clone.

If you look at the above issues, they are big, touch the very same code base and are (mostly) not dependent on each other. Given that basic language support is there in Drupal, lots of the things we are doing are cleaning up various areas of how it is done to achieve consistency and then expanding work on areas where it is missing (once we have a consistent approach to extend from).

I'm not sure how a git clone + branch workflow would help here. We'd either have per-patch branches, which would be just public versions of the patches applied (does not sound too useful?) or choose to set up a tree of merges between the patches, in which case we define the assumed commit order, which given the issues are largely not dependent on each other does not sound healthy either.

How do you imagine a git clone + branch workflow would help my initiative given the kind of problems we are facing?

Update: The background thinking that initiated this is that D8 is still not open for non-backported changes yet, and we are 2.5 months into my initiative with less than a month to go before Drupalcon London. With no changes landed yet and more and more patches to wrangle as a result, I'm not sure what to show of our progress and we can consider we made progress even. I'm not at all convinced that opening our own copy of the codebase where we can do whatever would lead us to any better progress vs. a better illusion of progress.