Using `git notes` for doling out real contribution credit

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Sam Boyer writes:

Our method for giving contributors credit sucks. We can't do realtime statistics of essentially any kind since we're reliant on parsing commit messages in a broken fashion. In the discussion we had about git workflows, I mentioned that we could use Git's notes feature to solve a ton of our problems related to giving credit where credit is due.

Basically, notes are a way that you can attach additional metadata to a commit. Lemme break that down:

  • The contents of a note is just an uninterpreted, arbitrarily large string - we can put anything in there we want.
  • Each commit can have one note per note namespace.
  • There can be an arbitrarily large number of note namespaces.
  • Whereas modifying the commit directly will change the commit's hash, which breaks history (similar to rebasing), notes can be added or modified at any time without changing hashes. So we can apply this approach retroactively, as well.

Rather than talking abstractly about what this could do, let me just give an example. Imagine that we had namespaces for code, reviews, and documentation. This real commit from core:

$ git log -1 b011fab8
commit b011fab838538b3ec21bff75ea806958570fdeef
Author: webchick <webchick@24967.no-reply.drupal.org>
Date:   Thu Jul 7 21:06:19 2011 -0700

     Issue #949616 by catch, a.mikheychik, adamdicarlo, dixon_, anavarre: Fixed Multiple parent relations confirmation form loses term description.

Could have all this extra metadata added:

$ git log -1 b011fab8
commit b011fab838538b3ec21bff75ea806958570fdeef
Author: webchick <webchick@24967.no-reply.drupal.org>
Date:   Thu Jul 7 21:06:19 2011 -0700

     Issue #949616 by catch, a.mikheychik, adamdicarlo, dixon_, anavarre: Fixed Multiple parent relations confirmation form loses term description.

Notes (code):
     dixon_ <dixon_@239911.no-reply.drupal.org>
     catch <catch@35733.no-reply.drupal.org>
     a.mikheychik <a.mikheychik@167981.no-reply.drupal.org>
     adamdicarlo <adamdicarlo@100783.no-reply.drupal.org>

Notes (documentation):
     adamdicarlo <adamdicarlo@100783.no-reply.drupal.org>
     joachim <joachim@107701.no-reply.drupal.org>

Notes (review):
     joachim <joachim@107701.no-reply.drupal.org>
     realityloop <realityloop@139189.no-reply.drupal.org>

Note: I did try to use people who participated in the issue, but the credit given here is really just me futzing around, not anything real.

As many namespaces as we want, and as many items in each namespace as we want. The same person can get credit for doing multiple things. If/when we get to per-issue repos, we can add buttons to "mark a commit reviewed," which would add our byline to the relevant note. We use the same conventions as how the author field is typically formatted, so it's really standardized.

Here's a couple of the commands I ran to make these additions, you'll get the idea very quickly:

$ git notes --ref code append -m "dixon_ <dixon_@239911.no-reply.drupal.org>" b011fab8
$ git notes --ref documentation append -m "joachim <joachim@107701.no-reply.drupal.org>" b011fab8

The note message could easily be derived from the information provided by the new API for providing authorship credit, as it's the same string.

Once we have all that granular data, the sky's the limit for how we can slice and dice it into much better, smarter statistics.

Comments

While I'm completely

sdboyer's picture

While I'm completely confident about the ability of notes to handle this problem of giving credit to multiple people for multiple different types of contributions, there are some edge cases I still have to investigate wrt notes' internal behavior in Git. I think recent versions of Git have it all pretty much sorted, but I still have to see how notes survive interactive rebases, what additional configuration is necessary in order to make pulling notes work perfectly right alongside pulling code, and how note conflicts get resolved.

This seems like a reasonable

webchick's picture

This seems like a reasonable way to do this, especially since we can back-credit all former commit messages, given enough time/volunteers (or a smart enough regex).

I guess the main thing to figure out is what type of data would we want to capture about an issue's contributors? Maybe "originally reported by" would be valuable? "Signed off by"? (the person who pushed it to RTBC)

We should try and prioritize these facets, because I'm assuming here that committers would need to do this, and this is going to add 10+ minutes to any non-trivial commit, even with just the 3 metric points you've outlined above. Unless it can be easily scripted, and my guess is that most of it really can't and really needs a human's interpretation of all 50+ comments. :\ So I'd prefer to lean towards easier-to-gather, high-impact metrics. This isn't just for me, it's for any other committer of any other project where a large team contributes rather than a maintainer/co-maintainer pair with occasional guest appearances.

Are there any other external projects we can look at who attribute similarly detailed commit credit (even if they don't use git-notes for it)?

Unless it can be easily

sdboyer's picture

Unless it can be easily scripted, and my guess is that most of it really can't and really needs a human's interpretation of all 50+ comments

We seem to trust dreditor to do this right now. I see no reason why this would be any different. And scripting it would be very easy, though not quite trivial.

Also, to be clear - "Signed off by" is something that's attached to the commit message. It is therefore useless to us, because it can't be used after the fact.

It's different because

webchick's picture

It's different because Dreditor is generating a CSV of usernames answering just one question: "Who worked enough on this issue to deserve commit credit?" It can take a reasonable guess by gathering up names of people with an attachment next to their name, and/or > some percentage of comments (I'm not sure what the exact algorithm is).

However, the template you propose above is actually asking 3 questions:
- "Who worked on the code for this issue enough to deserve commit credit?"
- "Who worked on the documentation in this issue enough to deserve commit credit?"
- "Who did a meaningful enough patch review in this issue to deserve commit credit?"

I'm not sure how Dreditor or any other automated thing is going to be able to answer those. They require human consideration, afaik.

Oh, and it goes without

webchick's picture

Oh, and it goes without saying that we need a dead-easy way to get those Git e-mail addresses that is not "click the username, click edit, click E-mail addresses, copy/paste "Primary address" times each. fricking. person.

I know you're working on this problem to some extent though in http://drupal.org/node/1127320

Yep, we'd be able to use

sdboyer's picture

Yep, we'd be able to use exactly that API. I'm planning on deploying it today.

Are there any other external

sdboyer's picture

Are there any other external projects we can look at who attribute similarly detailed commit credit (even if they don't use git-notes for it)?

Not AFAIK. I haven't done an extensive search or anything, though - I just haven't run across any. Tweeted :)

Original reporter seems

catch's picture

Original reporter seems potentially useful, at least for bug reports, and could be automated by dreditor (and possibly scripted retrospectively for older commits).

So:

reported
coded
reviewed
documented
?

I'm not sure how useful RTBCed would be - that can sometimes happen multiple times during an issue by different people - there is a lot of setting issues back to RTBC after the test bot was broken etc. that doesn't necessarily mean the person reviewed the patch, sometimes issues are RTBCed multiple times by different people etc, seems like a lot of extra work compared to just including it in reviewed.

Overall this looks great.

Possible note candidate

amontero's picture

reported
coded
reviewed
documented
+backported?

This sounds awesome. Metadata

wizonesolutions's picture

This sounds awesome. Metadata FTW. Upvoted.

WizOne Solutions - http://www.wizonesolutions.com - Drupal module development, theme implementation, and more
Fill PDF Service - http://fillpdf-service.com - Hosted solution for Fill PDF

I like this

Crell's picture

Overall I like this idea. I don't know how to use git notes yet, so I'm hoping that it's an easy task. :-)

In general though I like the flexibility this offers, especially since it can cover legacy CVS, current Git, and per-issue-repos-and-ponies Git future where committer actually means something. It allows us to capture more raw data than we do now, and we can slice and dice it later however makes sense.

I don't think it's a problem having a human decide who goes where; we have to do that now. It's only a marginal increase to decide "is this person coding or is this person reviewing"? However, I agree with webchick that the CX (Committer Experience) needs to be kept super easy, or no one will bother. Right now it's simple. Type "Issue #123 by Person A, Person B: Do cool stuff", done. It needs to be that easy for the committer to use this approach or it won't get used. We also need to cover, or at least address, the degenerate case of one person who is reporter, author, and committer because it's his module.

Also, I presume if a module maintainer is using git properly and has a zillion small commits in his own name (or anyone else's) that won't screw up any data parsing, right? Because I do that. Like, a lot. :-)

I'm not sure I understand why

deviantintegral's picture

I'm not sure I understand why git notes is worth the additional overhead compared to having separate commits with authorship information. In the example above, could each code and documentation commit be authored by the appropriate contributor, and reviews marked by amending a commit and signing off on it.

I suppose if you want to attribute those who work on an issue without providing any patches then git notes might be the only solution.

In a patch-based workflow,

sdboyer's picture

In a patch-based workflow, multiple commits aren't possible. Even in a commit-based workflow, individual commits may not be the work of only one person. And in a commit-based workflow, amending the commits means changing hashes, forced pushes and rebasing, which isn't particularly friendly to a shared collaborative public repo workflow.

As for "extra overhead," I would argue that the only reason notes some onerous is because they're unfamiliar. Which isn't irrelevant, but...really, notes are the preferred way (according to core Git people) to express additional metadata about a commit. Commit messages are the lazy, hacky way of doing it. Notes give us a nice, structured location for the data. Overall, I think the complexity of notes vs. commit message munging, considering both client & server side commands/work to be done, is either a wash or slightly in favor of notes.

Remember that --sign-off was part of the original approach designed for the kernel, where they don't have public collab repos (they use git-am) and so don't have to care about hash consistency.

"In a patch-based workflow,

gagarine's picture

"In a patch-based workflow, multiple commits aren't possible."

so solve the right problem...

In a no patch workflow we can create a branch for each issue...having pull request and so and so.

This will solve thousands of problem.

simply having pull requests

sdboyer's picture

simply having pull requests doesn't magically solve the problem of doling out credit.

please read the original proposal.

First I didn't say "only pull

gagarine's picture

First I didn't say "only pull request". I write branch by issue, possibility to clone this branch and finally pull request (than will be the equivalent to propose a patch).

It doesn't "magically" solve all the problems, but it do solve some problem related to the code. For example issue with multiple patch (so multiple commit) every one will have is attribution for what he actually write.

Attribution for people doing review is in my opinion overkill. The can do their review by correcting the code and making a commit!

With annotation in a patch based workflow isn't core/contrib maintainer than will have to do the job? Personally, I already don't have enough time to maintain my modules. But if I need to write long and complex annotation this will really kill the fun. I want to do a git add remote / git fetch / git merge / git pull and go on the next one.

The github way is a success because it's simple and works. #sarcasm So when drupal.org is moving to github?

First I didn't say "only pull

sdboyer's picture

First I didn't say "only pull request". I write branch by issue, possibility to clone this branch and finally pull request (than will be the equivalent to propose a patch).

this proposal has little to do with pull requests (though if you read carefully, you'll note i mentioned per-issue repos). it has to do with the manner in which credit is given. we've got plenty of discussions about pull requests (i'd love to see something like it, but there are blockers on the way to it). please don't muddy the waters.

to be clear, however, i am not arguing for this as a means of staying in a patch workflow. far from it. if anything, the structured use of notes is a clear, empirical improvement on pull requests, as it allows for the signed-off-by behavior which email-based (linux kernel, git core) system utilize in an environment where hash changes via rebasing can't (always) be done.

It doesn't "magically" solve all the problems, but it do solve some problem related to the code. For example issue with multiple patch (so multiple commit) every one will have is attribution for what he actually write.

on re-reading my original post, i realize that without an awareness of some of the back-discussion that prompted it, it could seem like i'm advocating patch-based forever. i'm not. i wrote this in the context of something we could do now to improve the way we give out credit, and it would simply grow in its effectiveness as we introduce a per-issue repo/pull request-type system in the future.

With annotation in a patch based workflow isn't core/contrib maintainer than will have to do the job? Personally, I already don't have enough time to maintain my modules. But if I need to write long and complex annotation this will really kill the fun.

as webchick and i were discussing in the other comments, this really isn't worth doing unless it's scriptable. i'm a module maintainer, too. the last thing i want to do is make that job harder.

I want to do a git add remote / git fetch / git merge / git pull and go on the next one.

nothing about this proposed system would prevent you from doing that with per-issue repos. as i alluded to above, this would simply be an additional layer of data on top of the first-class commit data.

#sarcasm So when drupal.org is moving to github

when the benefits of doing so outweigh those of keeping the community centralized and organized.

This sounds awesome!

Wim Leers's picture

This sounds awesome!

Being able to attribute to

boombatower's picture

Being able to attribute to more then just code and pick up old data...this is great.

it is really difficult for me

AFTIPR's picture

it is really difficult for me to manage

PSA about issue

marvil07's picture

Just in case someone missed it, there is a new issue about this in d.o
Add support for reading git-notes.

Improvements to core

Group categories

Category

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week