Statistics to track - fodder for Intro to Drupal / Health of Drupal

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by greggles on October 12, 2007 at 9:23pm
Last updated by greggles on Fri, 2010-01-08 17:11

Part of the basis for any marketing or intro is a set of numbers showing how the project and software is changing over time. There have been lots of different one-off attempts to gather these statistics over time. I'd like to gather different metrics that people feel would be important, list how those metrics are useful (what does this number mean?).

This page is to document the statistics we want to measure and identify potential sources. Eventually I feel that each metric (or group of metrics) should get a handbook page and a spreadsheet file in CVS that holds the actual information. Then we can revisit those documents periodically and update them with new numbers. Feel free to add/remove things via the wiki page or comments.

Metric: # of lines of code, comments, blank lines
Significance: Shows general growth, commitment to documentation at the code level, cleanliness of code.
Source: Ohloh and Dries thoughts on it

Metric: % of javascript, html, php
Significance: Helps dispel myth that Drupal is ugly and lacks "Ajax"
Source: Ohloh again

Metric: Drupal.org users, comments, nodes,
Significance: growth of the flagship community hub
Source: Killes? Lots of people have done this Steven Wittens did it perhaps the best

Metric: # of modules & themes & core translations in cvs.drupal.org over time
Significance: Shows growth of "contrib" which is one of the great features of Drupal, dispels Myth that Drupal has no themes, Shows international interest
Source: cvs.drupal.org (greggles can quantify this - anyone with some cvs fu and some time can, really)

Metric: drupal.org visitors, bandwidth consumed
Significance: yet another sign of growth of general interest - interest, not necessarily contributors though
Source: killes?

Metric: # of patches by user, total #of patches that have been committed since installing project module, # of contributors to core per release
Significance: Shows that it's more than just a few folks working on core (though we all know CHX is at the top of the list)
Source: Drumm's Drupalcon Barcelona presentation - also cvs-release-notes.php + grep == same data - also see Drupal 4.7.0 release notes and 5.x.0 release notes

Metric: sites in the wild that appear to be running Drupal
Significance: Our current closest measure of the "penetration" or spread of drupal
Source: Maybe Khalid has blogged about this?

Metric: Number of ponies Eaton owe Dries for commits?
Significance: Funny
Source: Eaton Dries?

Metric: downloads - downloads of drupal core tarfile
Significance: another measure of penetration
Source: Killes? Dries for 2006 and Dries for 2007

Metric: Module downloads over time
Significance: Shows downloads of popular modules over time
Source: November 15 2006 and my numbers for april 2007 january 2007 oct 2006

Metric: IRC stats
Significance: shows strength of development and support communities
Source: http://drupal.zind.net/drupal.html and http://drupal.zind.net/drupal-support.html

Metric: Page requests/second for each release of core
Significance: Shows improvement in performance of core
Source: Dries for 4.7 vs. 5.0

Dumping ground of queries

(I wrote these up all pretty and then lost my edit! argH!)

Groups by type
select count(distinct n.nid), type, td.name from node n inner join term_node tn on n.nid = tn.nid inner join term_data td on tn.tid = td.tid where n.status = 1 AND td.vid = 3 group by type, td.name

Projects by type
select count(distinct n.nid), type, tn.tid from node n inner join term_node tn on n.nid = tn.nid where n.status = 1 AND tid in (14, 15, 96, 29) group by type, tid;

gender
select count(1), value from profile_values p inner join users u on p.uid = u.uid and u.login > 0 and u.status = 1 where p.fid = 7 group by value;

people who have committed something
select count(distinct uid) from cvs_messages;

active users
select count(1), status from users where login > 0 group by status;

Comments

Some stats for marketing

Posted by Amazon on October 18, 2007 at 10:44pm

October 17th, 2007

http://drupal.org/forum - 300K posts
General topics: 1069 +11174 + 2555 + 238 + 704 (15740 general posts)
Support topics: 2336 + 7490 + 1901 + 40035 + 763 + 445 + 2239 + 9327 + 4984 + 495 (264657 Support posts)
Development topics: 191 + 233 + 1268 + 495 + 333 + 269 + 366 (14701 Development posts)

Collaboration of the Drupal development community. Measure unique contributors who have followed up on an issue.
Oct 5, 2006 - Oct 2007
8800 unique contributors who followed up on Drupal.org project issues.
1770 unique contributors who posted patches on Drupal.org project issues.
Growth rates of unique contributors have been doubling since 2004.

To seek, to strive, to find, and not to yield

New Drupal career! Drupal profile builders.
Try pre-configured and updatable profiles on CivicSpaceOnDemand

A Little Benchmarking Material

Posted by libsys-gdo on October 19, 2007 at 5:34pm

"In open-source development communities, 4% of members account for 50% of answers on a user-to-user help site (Lakhani & Hippel, 2003), and 4% of developers contribute 88% of new code and 66% of code fixes (Mockus, Fielding, & Andersen, 2002)."
http://jcmc.indiana.edu/vol10/issue4/ling.html

The stats are a little out of date; I wonder if the ratios have changed in recent years.

From a PM:I realized that

Posted by Bèr Kessels on October 30, 2007 at 10:00am

From a PM:

I realized that many of these are included in xstatistics module or could
be added there.

If you are interested, I will grant you CVS write access to that module, so that we/you can start adding these features in a (6.x?) branch.

I would like to extend the module with the following major
features:

Show things like nodes, users, comments, etc. created within date ranges
since the creation of the site.

I plan to execute the first one by creating a database table
date_begin|date_end|nodes|users|comments|

I'd say that a table to store this, is not a very good idea. Much better would be to build a more general set of statsics-summary table. So, not to limit it in columns to only users, nodes and comments, but to rather think of a slightly broader concept.

Thing is, that xstatistics itself, has a mayor design limitation: it has the queries hardcoded.
What I was thinking about, is to give each query/result(-set) a unique ID. This could be an MD5hash of hte query, or a more human readable string, like 'count_nodes_in_queue'. ight now, we have very few queries, so merely mentioning them in a comment in the code would do, IMO.
Then your summarytable could look like:
date_begin|date_end|id|value
Where value is either an integer a string, or, for now[1] a serialised struct or array. he SQL type should therefore be general enough, something like "text". Reading that is suboptimal, but untill we can revert to only-ints (see note below) it will suffice. Such tables will prolly not be read a huge lot anyway, compared to e.g. node or user tables.

Next thing, is that I think your approach of start/end is not the most optimised thing. After all: the next rows 'start date' always is the same as the current rows 'ending date'...
Therefore, "range", a timestamp can be a setting, used for finding out when to store a new row, only. Which would make your column even simpler:
created_at|id|value

What do you think? Can we improve this? Am I completely wrong? Do I misunderstand your approach?

[1] I sincerely beleive serialised data is BAD. And that it should only be used by people whom dont (yet) know how to design a proper database. For now it will do, but a next step would be to make this a pointer to a foreign key, in a table that contains that data in a proper, normalised way.

http://www.webschuur.com | http://bler.webschuur.com

I think range and frequency of use

Posted by greggles on October 30, 2007 at 11:17am

I think you're probably right about the range. I was thinking the wrong thing about that. The question really is just "how many were there at point X".

I like the idea of a simple table that can be used for a lot of different kinds values. But I want to balance that against forcing a lot of data into one table just so we don't have to design/build/maintain a normalized data structure. Especially with SchemaAPI in Drupal6 that kind of table maintenance has become quite simple.

I'll begin working on patches and we'll see where it goes.

Thanks.

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

knaddison blog | Morris Animal Foundation

Statistics to track - fodder for Intro to Drupal / Health of Drupal

Dumping ground of queries

Comments

Some stats for marketing

A Little Benchmarking Material

From a PM:I realized that

I think range and frequency of use

The Marketing of Drupal

Group organizers

Group categories

BAM Section

New groups

Group notifications