How Many Contributors Were There for Drupal6

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
greggles's picture

I've tabulated some data based upon the commit messages for Drupal6. Note, this is not a good judge of contributors. A better judge would be "people who have "commented on an issue and changed the status" or "uploaded an attachment" or even "posted a comment" but this is one way of gathering the data.

I wanted to capture my method and share the output data. The data is available here: http://spreadsheets.google.com/pub?key=pVuTBhrgLH93h7VQNyyynXw

cvs -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal co drupal
wget http://cvs.drupal.org/viewvc.py/drupal/contributions/tricks/cvs-release-...
mv cvs-release-notes.php\?revision\=1.6 drupal/cvs-release-notes.php
cd drupal
chmod +x cvs-release-notes.php

./cvs-release-notes.php DRUPAL-5-0 HEAD > CVS_CHANGELOG.TXT

Copy and paste the resulting data out of CVS_CHANGELOG.txt into a spreadsheet column H (H so that this formula will work)
=MID(H2;FIND("by";H2)+3;FIND(":";H2)-FIND("by";H2)-3)

Copy/paste that all the way down the spreadsheet. This finds the 'by *** :' section and gets just the ***

copy that column
paste into a text document
copy back into OpenOffice.org calc (spreadsheet program) - this fires the text import wizard
split on spaces and commas
find and replace all commas
find and replace junk - find "and" (match "Entire cells") and replace with blank space - ditto for "et" and "al" "by" etc.
manually fix certain names (Jose A Reyero, Gurpartap Singh, David Strauss, Moshe Weitzman, Rok Zlender)
manually fix lots more stuff
manually replace "myself" with Goba (at least for this release that's the case)
put everyone into one column, run a Pivot Table on it, Poof
publish - get more people to review, manually fix more, repeat

Comments

thanks for documenting

Gábor Hojtsy's picture

Thanks for documenting, so we can do this in later releases and compare. For the record, that spreadsheet lists 272 contributors, the top ten being:

chx 141 1
pwolanin 87 2
Goba 67 3
dvessel 60 4
keith.smith 55 5
webernet 48 6
webchick 47 7
JirkaRybka 41 8
catch 38 9
bjaspan 36 10

Some people expected that we get a number beating the 492 in the Drupal 5.0 release announcement though. Since the number of issues closed/fixed grown to more then 1540, so the contributor number is expected to grow as well, if we find out how was the 492 counted. Repeating counting of the "number of people who commented on issues" metric on the 1540 issues we had should lead us to a comparable number I hope. Anyone interested in taking that up? (That definitely requires access to the drupal.org database)

Here's Drumm's slides with

catch's picture

Here's Drumm's slides with similar stats for Drupal 5 for reference[1]. I guess the queries ran for those will be different with IFAC, but maybe some of it could be reused if it's still around?

  1. http://www.slideshare.net/drumm/maintaining-your-own-branch-of-drupal-core/

hm, no queries in the slides

Gábor Hojtsy's picture

Hm, we need to gather these queries from him, hm.

FIND(":";H2) needs to

drumm's picture

FIND(":";H2) needs to include comma and period, both are used.

handled later, right?

greggles's picture

I think that was handled later:

split on spaces and commas

That and the manual editing handles messages like "#123 by someone, another, and another: something".

If there was a commit message like "#1235: by someone" then that would break my parsing steps but I didn't ecounter any of those.

This isn't meant to be perfect (GIGO and all that), just close enough to be useful without taking me 10 hours.

--
Open Prediction Markets | Drupal Dashboard

That's a neat bit o' data,

keith.smith's picture

That's a neat bit o' data, speaking as someone who ranked #5, #74 and #180 (#180 is probably a duplicate of a commit already counted in #74).

Note though that your rank column is really a "row" column, as rank should be incremented only when the number of commits change to properly account for banding (ie, everyone with a patch of 1 should have the same rank).

fixups

greggles's picture

Yes, I was hoping that by publishing this I'd get feedback on names that were mixed up. JirkaRybka vs JirkaRbyka was another one.

I also did a slightly better little ranking formula that relies on a column for the row number (=IF(B1598=B1597;C1597;D1598)) this now counts ties in mentions as the same rank and jumps to the row number at the next level after the "tie". That seems like how I've seen it.

I updated the spreadsheet based on these - same location - http://spreadsheets.google.com/pub?key=pVuTBhrgLH93h7VQNyyynXw

--
Open Prediction Markets | Drupal Dashboard

Thanks for merging those

keith.smith's picture

Thanks for merging those entries, and refining this further. This is getting close. I notice that the Rank jumps some numbers entirely, though, like "12", "14", "18", "22", "23", "70" -"78", etc.

yes, on purpose

greggles's picture

The rank skips 12 on purpose. For example, Crell and merlinofchaos are tied at 30 mentions each, so they are both rank #11. The next person after is ranked not 12th, but based upon their row number since they aren't the 12th in line but in fact the 13th. At least this is the way I've always seen it done for rankings with ties.

--
Open Prediction Markets | Drupal Dashboard

Oh. Ok, yes, I understand

keith.smith's picture

Oh. Ok, yes, I understand what you're doing now. I've usually seen it simpler than that, without the skips, as rank does not necessarily imply a relationship to row order. But, now that I understand it, I see that what you have here is a perfectly reasonable way of doing it. Thanks for explaining (and thanks for the table!).

new technique

greggles's picture

The original script has some problems. It only counts edits to existing files. It requires a tag, so you could only do up to a BETA, RC, or UNSTABLE release, not to the tip of HEAD.

Now I've not got db access :) Here's what I do:

SELECT DISTINCT cvsm.message FROM cvs_files cf INNER JOIN cvs_messages cvsm on cf.cid = cvsm.cid WHERE nid = 3060 AND cf.cid > 99596 and branch = '';
  • the message is where we know usernames, we distinct it because it could be on multiple commits and/or files
  • we need to know cvsfiles to know that the branch is '' (i.e. this is HEAD).
  • we need cvsmessages to get the message
  • nid 3060 is Drupal
  • cid = 99596 I determined was the first commit to Drupal 7 by guessing based on dates.

I output it to a csv file using:

mysql -u username -p {stuff stuff stuff} -e "select DISTINCT cvsm.message from cvs_files cf inner join cvs_messages cvsm on cf.cid = cvsm.cid where nid = 3060 and cf.cid > 99596 and branch = ''" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > mysql_exported_table.csv

On the d.o hardware this query executes in way less than a second :)

From that point on it's the same process as the original post.

New data due soon. Teaser: 469 contributors to Drupal 7.x core, up from "200 to 300" back when Drupalcon DC happened.

P.S. There are some names that my parsing screws up. I now do a find/replace for " " to "_" before splitting things and then back again after. Here is the current list:
Jody Lynn
Damien Tournoud
David Strauss
Dave Reid
Rob Loach
Jose A Reyero
Gurpartap Singh
Moshe Weitzman
Rok Zlender
David Rothstein
Roger López
John Resig
Josh Waihi
Todd Nienkerk
David Rothstein
Steven Jones
Jody Lynn
Ian Ward
Narayan Newton
Darren Oh
Robin Monks
John Morahan
Uwe Hermann
Wim Leers
Alexander Pas
Ryan Palmer
wretched sinner
Steven Jones
Aron Novak
Ryan Palmer
Garret Albright
Mike Wacker
Gerhard Killesreiter

--
http://growingventuresolutions.com | http://drupaldashboard.com | http://drupal.org/books

code swarm

marvil07's picture

Some time ago I post a message about making a code swarm video for drupal, that I made based on cvs users.

Apropos this post and the message at devel list I made a drupal 7 branch one to prove my slow log parser :-p.

So, maybe it could be interesting for some of you to take a look at them.

PS: I do not know why I was detected as spammer too many times, so I put files outside(code swarm files and slow log parser)