Project metrics for drupal.org redesign

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

One of the big goals of the drupal.org redesign is to make it easier for end users to find the right modules for the site they're trying to build. With over 5,000 contributed modules, many of them providing similar functionality, it can be extremely difficult to choose. One method is to try to assess the "health" of the module, by how actively it's maintained, used, supported, etc.

One approach to gauging health is posted at Project ratings and reviews for drupal.org redesign. While that proposal deals with subjective factors, this post addresses objective facts about a module that can be computed and displayed for all projects hosted on drupal.org. While no single metric can tell you what module to use, and some knowledge will be required to make the best use of this data, it's important to make these statistics more readily available on drupal.org to empower users to make better decisions.

The redesign prototype for project pages includes the introduction of a sparkline showing the "Activity" for each project.

Project activity chart
The details of this activity chart weren't specified at a technical level during the design phase, but the spirit of the design is that they wanted more ways to visualize the health of a project. As we're implementing the redesign, we've been empowered to provide as many charts containing specific data we think will best help end users make sense of what's going on with a project. Read on for our specific proposal, including what metrics to compute, and some ideas on how those are going to be visualized on the new drupal.org

Specific metrics

There are literally dozens and dozens of metrics that could be captured and displayed. We've already got support for the usage of any given project and we're working on support for download statistics. Beyond those, we believe the following are important to assess the overall health of a project at a glance:

  1. Issue activity: Each project category (bug, feature request, etc.) -- number open vs. number fixed/closed
  2. Number of issue reporters (unique users filing new issues in a defined time period)
  3. Number of issue participants (unique users filing issues or comments in defined time period)
  4. Total number of issue comments posted over a week (ideally with a separate total for the number of comments by any of the project maintainers, both of which can be graphed on the same sparkling).
  5. Release activity: Number of releases in each time period

We're proposing to normalize all metrics to a weekly granularity. This would both simplify the storage (so we're not trying to store daily metrics) and the UI (since it'd be best if all the charts used the same granularity to make it easy to compare them).

Additionally, the following metrics could be useful, but we might not have time to implement them for the initial launch of the redesign:

  1. Commit activity (This would be great, but it's not worth our time to add this for CVS with the Git migration imminent).
    • Number of lines added/removed
    • Number of commits
  2. Total number of tests and percentage of tests that pass
  3. Total lines of code vs. lines of comment
  4. Average length of time that issues are open
  5. Number of unique users
    • submitting patches
    • reviewing patches

We're trying to strike a balance between what's relatively easy and sane to compute, implement and display visually, and things that will help users find the best project for their particular needs. Given that, if you have suggestions for other metrics we should be considering, please comment below!

Enter the project_metrics module

We looked at a lot of Drupal charting modules (see below), but basically none of them handle the storage for you. So, no matter how we end up displaying this data, we need somewhere to compute and store it. Enter the project_metrics module.

This would be a new sub-module included directly in the Project project. While the project_usage module is really only relevant to sites that are using Project to manage releases of Drupal code to track update_status usage, the project_metrics module could be useful for just about anyone running the Project suite. It would be responsible for computing the metrics and storing them.

To compute, the idea is we'd write a series of drush commands that would be run periodically to do all the heavy lifting to compute the right statistics for a given week. These commands would insert records into the project_metrics module's DB tables. Then, project_metrics would provide various ways to access and display the data (see below).

The basic architecture of the module is that it would invoke a hook to allow other modules to advertise what metrics they want to provide. The project_metrics module would then be responsible for invoking the appropriate functions in the other modules at the right frequency and storing the results. So, project_metrics itself wouldn't know how to query the issue database tables looking for statistics. That'd still be the responsibility of the project_issue module. However, project_issue wouldn't have to worry about invoking itself via cron, wouldn't have to manage its own tables to store the historical data, etc.

Front-end display

So how would all this data be visible on drupal.org? The key metrics would be exposed via sparklines on the project page itself. Depending on how many metrics we end up with, we might need to add a tab off project pages (or use JS to show/hide the full list of metrics) so that it's possible to drill down and find as many statistics as we provide, without overwhelming the user with all of that data directly on the default project pages. My vision is that there's an easy way to see 5 - 10 sparklines, each with datapoints at 1 week granularity, all vertically stacked so the weeks line up. That way, you can see how the different metrics correlate. So long as the scale of the horizontal axis is the same on all graphs (so they line up and are easy to compare), we can use a different scale for the vertical axis for each sparkline so that they all make the most visual sense (e.g. the number of releases in a week is probably going to be 0-4 most of the time, whereas the number of issue comments or lines of code added/removed could be in the hundreds or thousands). With the charts stacked so the weeks line up, you could easily see for example that one week the "number of lines of code added/removed" sparkline goes nuts, and the "number of open bugs" chart started climbing soon thereafter. ;)

There are some metrics and statistics specifically about the issue queues that are already on drupal.org, they're just mostly hidden. For example, you can view statistics about the Drupal core issue queue. This page will probably get some much-needed attention (it hasn't been touched in years). Although none of the UI parts of this proposal are set in stone, it's likely that we'll update these per-project issue statistics pages to include more of the issue-related metrics discussed above. The idea is that we'd put the current week's raw data in the tables near the top of the page, and then provide sparklines below to see how those values have changed over the weeks of the last year.

Additionally, we're going to expose some of these metrics to Solr to make it possible to filter and sort projects by various metrics. We've already done this with the project usage data (for example, this is the default sort order when you browse module projects on drupal.org). So, in addition to being able to sort by "Most installed" (and hopefully soon, "Most downloaded"), you might also be able to sort by "Most active issue queue", "Smallest % of open bugs", "Most commit activity", etc.

However, when it comes to visualizing the data that the project_metrics module would be providing, we've investigated a few possible ways to generate the necessary charts:

Sparkline-aware views display plugin

We could expose all of the project_metrics data to views, create views for whatever we care about, and write a Views display plugin that knows how to render our results as a sparkline. This could potentially be done as part of the Views charts project, or as its own new "Views sparkline" contribution. Either way, we'd hope to make use of the Sparkline module. The Views display plugin would simply be glue to take the data from the results of the query that Views ran and format that data in a way that the Sparkline module expects to be able to generate the sparkline itself.

Charts API

We could potentially use the Charts API to handle our charting needs. We'd still probably drive the queries via Views and have a display plugin to render the results via the Charts API. So, this is more an alternative to the Sparkline module itself -- either way we'd probably be exporting the project_metrics data to Views and writing a display plugin.

Quant

We looked at the Quant module, but it doesn't really seem like it gets us very far. It can do really complicated queries to try to figure out the historical data for you, but that's going to probably kill the d.o database server. It doesn't do any storage for you. And, we'd have to write some code to expose our data to Quant. At that point, we might as well just expose it to Views since that seems a lot more flexible and powerful.

Implementation roadmap

From now until around August 20th, we're just going to gather feedback on this proposal. To prevent the discussion from getting fragmented, please add comments directly on this post.

Starting around August 23rd, we're going to begin implementing the project_metrics module, and any changes to the rest of the Project suite, to make it possible to compute all these statistics. We expect all the backend work to take approximately two weeks.

Starting around September 7th, we're going to evaluate the front-end options and pick one to roll out on drupal.org. We're aiming to get these metrics visible on project pages and in the ApacheSolr index on the live drupal.org independent of the launch of the redesign theme (which is called "bluecheese"), unless it involves significant work in the existing drupal.org theme ("bluebeach").

Background reading

Interested readers can check out the following threads for more on this juicy topic:

Comments

What do other systems do?

dww's picture

merlinofchaos asked me in IRC "Has anyone done any research in the sense of checking out what other project/tracker systems are charting?". Good question! Not that I know of. ;)

If anyone has experience with other systems and what kind of stats and charts they provide, please share that info here (ideally with screenshots where appropriate). If anyone wants to volunteer to do some of this research, that'd be a big help. Thanks!

To be honest I really like

mikey_p's picture

To be honest I really like some aspects of what github provides, with the consideration for the fact that they are almost 100% code centric, and not issue aware.

mikeyp's Profile - GitHub

I like that this gives a quick overview of when it was last updated both with a date, and on the chart itself. I also like that the chart does double duty of showing my activity out of all the activity on the project. This makes it a breeze to see if something has been updated since I last looked at it, or whether or not it's under active development. The only downside is that this little overviews of a project are only available when viewing a users page that lists all their projects. It does give a way to quickly evaluate a user's activity though, when looking through all their repos, but I wish this information was available in the same format on the project page itself.

sourceforge

tests++

grendzy's picture

I love the idea of putting simpletest metrics on the project page. Let's make it a badge of pride to have good test coverage for your module!

Counting the number of coder.module notices would be interesting to look at as well, although false positives are a concern.

false positives are a concern

mikeytown2's picture

I will agree with this. There are certain times that the coder module is wrong. It's rare but it does happen.

Example
"table names should be enclosed in {curly_brackets}" - critical

<?php
$result
= db_query("SELECT * FROM pg_indexes WHERE tablename = '{%s}'", $table);
?>

In this case its a work around for postgres. Granted most modules don't mess with the database, so this is fine from my point of view.

Other issue has to do with update_sql
"Use update_sql() instead of db_query() in hook_update_N()" - normal
Impossible to insert serialized data with update_sql()
Totally valid false positive.

Modules can now selectively ignore certain reports

IceCreamYou's picture

See http://drupal.org/node/311259

This is a good thing for modules that honestly have false positives, and a bad thing for modules that dishonestly abuse this to bring their error count down.

Test++

auzigog's picture

Agreed! This is key as we move into Drupal 7 and simpletests become pervasive.

metric purpose

bekasu's picture

I'd like to see a visual health status of each module as well.

However, I'd like to see this used in an active manner as well as the passive 'reporting' we are discussing here.

Ostensibly we are talking about the health of a module. I assume we are prepared to define what constitutes an unhealthy module. If the module is unhealthy, or becoming less healthy, then we should automate a notification to the module project maintainer(s). Provide recommendations for improvement. Quarantine or archive as needed.

With regards to the data to display. You should read about McCabe and/or Halstead software metrics.
Here is a pretty website with official party line: McCabe
Here is a academic paper about decision trees for software metrics : Software Metrics
And a study of software metrics (masters thesis): Software Metrics Study

These papers are fairly dry. However, they do give you the basics of what you may want to measure and why. There are plenty of similar sites/papers; but, generally these cover the basics of what you need to know.

I'll continue looking for graphics of metrics and will post back here once I find something of value.

Thanks for the input

dww's picture

I just skimmed the McCabe site. Definitely some interesting things to consider in there. However, the vast majority of that stuff isn't going to mean anything to end users of drupal.org trying to build a website. ;) We all need to remember the target audience and the purpose of computing and displaying these metrics.

The problem we're trying to solve: It's hard to find the good modules for building a Drupal site

Giving people a chart of the "Cyclomatic Complexity Metric" over time isn't really going to help them. ;) And having been an Associate Researcher in a major Computer Science department for over 10 years, I'm joking about that as one of a probably very small handful of people in the Drupal community who can read the description of what that means and have a hope of actually understanding WTF they're talking about...

Furthermore, none of these metrics (the ones I proposed above) are completely bullet-proof. I maintain some modules that haven't needed a commit in literally a few years. They're small, they do what they should, they don't have any bugs, so I don't touch them. The fact there have been no commits, no releases, and basically no issues for many months doesn't mean the module is crap and should be quarantined or that I want to start getting weekly emails from d.o about them. These metrics are all just hints, to aid in a decision, not something that can be used to blindly decide (either for the end users, nor for the drupal.org project system).

Furthermore, the scope of this post isn't a pie-in-the-sky wishlist of things that might be useful someday. It's a battle plan for work that the Drupal Association is sponsoring to get the redesign out the door ASAP. So, we're focused on what's going to be helpful towards solving the problem without massive delays for the redesign as a whole. Automated notifications to project maintainers and project quarantines and such are definitely out of scope for what we're proposing.

All that said, the McCabe site did remind me of some other metrics that might be relatively easy to compute/store and could be helpful in deciding if you want to use something:

Total lines of code vs. lines of comment

Generally speaking, lines of code are dangerous. ;) The more of those you see, the more complex the thing is, the more likely there are bugs, and the more work it'll be to upgrade to a new version of core. The lines of comment is helpful to gauge if the project maintainers care about documenting their code. If so, chances are fairly good it's been thought through (at least a little), and that you're going to have a prayer if you ever need to look under the hood. If there are almost no lines of comment, chances are very good that the module was thrown together by a programmer without much discipline or care, and it's likely to be a steaming pile of bits. So, I'm going to add this to the "Optional metrics" section of the post.

Thanks again for taking the time to read and comment on this post!

Cheers,
-Derek

Lunch at the University

bekasu's picture

Derek,
Could you tell I'd just spent the day with a group of Ph.D's?

So rather than 'health', you are really looking to provide a visual aid which gives a person an easy way to determine their 'comfort level' with the module.

One measure (e.g., number of comment lines) would be important for a person looking to enhance or modify a module; whereas, another measure (e.g., bugs resolved) would be important for a new drupal user looking to extend their website.

In your example of the small, aging module with minimal changes or bugs, the visual that might be most helpful is number of downloads since the code is stable and reliable.

I agree the links I provided are less relevant. They are much more targeted to development of software and software projects.

Glad you could find some value in the website link.

bekasu

I quite like punch-card

lukus's picture

I quite like punch-card visualisations: http://raphaeljs.com/github/dots.html

I feel that project metrics

ChrisBryant's picture

I feel that project metrics like this are really important area for the community. I hope to plug in and help with some areas of it. In case anyone hasn't seen it, check out the little stats/charts that Damz did:

http://damz.org/drupal-7/

These are the perfect little visualizations that will be great to have spread over Drupal.org.

about commit activity

marvil07's picture

For commit activity, versioncontrol module already store the needed data:
- modified lines at versioncontrol_source_items table, and
- the number of commits could be a db_query('select count(vc_op_id) from {versioncontrol_operations} where type = %d', VERSIONCONTROL_OPERATION_COMMIT).

Not sure if this is the right moment(seems like this is going to happen before git migration), so just posting to communicate it :-)

DBA perspectives on project health...

jdonson's picture

Hi Everyone. Great set of D7 entries...

** SCM commit quality is driven by exhaustive unit testing. **

The integration of SimpleTest into the continuous integration
(test and) commit and scheme is a generic code project health opportunity for the Drupal community. http://drupal.org/project/simpletest

Standardizing simpletest design and testing practices, as well as commit log entries seems like important stuff!

Meanwhile, leveraging a common API for SCM across implementations and communities is a serious D7 crisis/opportunity.

( I believe this was what marvil07 was alluding to in his post above: http://drupal.org/project/versioncontrol_project )

Since there has been voiced interest in looking around, the following open api for SCM seems quite relevant:

http://wiki.hudson-ci.org/display/HUDSON/Plugins#Plugins-Sourcecodemanag...

Mostly, I seek discussions about criteria for code project quality that is not myopic.

Otherwise, I expect that 1000 commits per day with GB ascii files of bunk will be a winning metric. :-)

Similarly, project download count is loosely related to actual adoption and use.

In response to the other entries above...

Forgive my shameless ignorance, but Drupal Code Project Health does not occur to me as a standard set of metrics and reports.

A set of standard SCM commit reports is a nice idea too, but one size never fits all.

Also, Drupal.org code projects and Drupal web site projects are easily confused in this set of discussions.

The questions of what data to collect are far more central than how to report effectively, which is always hard to predict.

** Therefore, we might take a look at what is being collected and what will richen that set of reporting options. **

Jeremy Donson
Database and Systems Engineer
New York City

Online analytics team and proposal for drupal.org

goalgorilla's picture

Repost of: http://groups.drupal.org/node/60828 (hope to find some enthusiastic users here!)

On the SEO project team for the drupal.org redesign we recently discussed doing more with analytics measurement of SEO goals and objectives.

After analysing the Drupal.org Google Analytics data we found no goals are specified currently. Neither does there seem to be any governance or
methodology.

To optimize drupal.org we need to take analytics serious. Therefore we have created a proposal of integrating online analytics into Drupal.org and
into the Drupal community. Not just for SEO; we need to improve user experience and gain insights.

For example, Robert Douglass spoke the other day at the DrupalJam Amsterdam of improved performance using Drupal and Solr. We agreed it would be great to back-up his story with actual user data.

We are looking for experts who will support us in setting up this analytics team. Please have a look at the Google Doc and reply on this discussion if you want to be a part (please mention your expertise).
http://docs.google.com/Doc?docid=0AUa5jgV2VYBPZGZwemN2NmJfMTVjbmtydjNjNA...

After finishing our proposal we will send it to the Drupal Association to approve implementation.

Many thank in advance and looking forward to your replies!

Taco

I would like to join the

jmesam's picture

I would like to join the team.

Metrics mentioned in the Git discussion

Liam McDermott's picture

Many advocated increased use of metrics in the discussion: Source code, projects, and commit access in a post-Git world. Here are some highlights of the metrics people thought up:

  1. Use metrics so security team does not have to cover all contrib modules: 'Installed on > 1,000 sites', 'Passes Drupal coding standards checks' and 'Project has tests' for example*.
  2. "Badges" on project nodes to denote test coverage, coder module compliance, etc. as a quality indicator. There's a bit of discussion around this, search for 'badges' on: http://groups.drupal.org/node/114264 | Some others that were mentioned, which sound like badge-material:
    • This project has received overwhelming community approval.
    • Passes drupal coding standards.
    • Hasn't had a security bug reported in X weeks.
    • Project has recently been audited by a member of the security team.
    • Has opted-in for a code review.
    • Has been through a code review.
  3. This project has XXX,XXX tests and XX% test coverage, with spark line graph showing how that's changed over time.
  4. Whether the project has gone through some sort of code review, see: http://groups.drupal.org/node/114264#comment-367799
  5. A numeric rank (colloquially named ModuleRank™) for ordering module search results and listings, which would take into account the metrics mentioned already and: whether the module has a stable release, passes criteria needed to be supported by the security team, passes coder.module checks. A number that will show both project health and give a metric by which better modules can 'bubble-up' in search listings.
  6. This project was last audited by a security team member on xx/xx/xx.
  7. Coder module compliance was mentioned several times, including Coder and Coder Tough Love tests for all releases, with results near the download links and requiring maintainers to annotate issues found by Coder.
  8. Some sort of performance reviews.
  9. Issue queue stats (which has been discussed elsewhere).
  10. Reviewed by the community - module contributers can do quick reviews of/flag project releases made by other contributers (a kind of speedy peer-review).

    We could have a "Modules I reviewed" page as part of their profile, so if I want to find out what Views related modules merlinofchaos has reviewed I can find them easily.

    See: http://groups.drupal.org/node/114264#comment-367159

Hopefully this information is relevant to the group's interests. :)

* Apologies if this really belongs to another discussion.

Very useful and relevant. Two

greggles's picture

Very useful and relevant. Two quibbles.

  • Hasn't had a security bug reported in X weeks.

I don't think this is really a valuable metric. It has more to do with the number of people looking at the code to find vulnerabilities as it does with the actual security of the code.

  • Project has recently been audited by a member of the security team.

The security team doesn't review modules. It's unlikely they would start and if they did it's unlikely they would want to have their name as having approved a module because that makes them somewhat responsible if an issue is later found in a module.

I get it that people want to try to quantify the security of the code but I think other metrics (passing code style, being used a lot) are better proxies for the security of a module.

Module security reviews?

deviantintegral's picture

In the past, I've run modules by someone on the security team (damz and dww come to mind) before adding it to the vendor branch. It was never very formal (usually just a glance and OK on IRC), but I think it's a good idea to at least do a basic review before adding a new module. That way, we're not deploying code that has obvious novice security issues in it.

sure, but

greggles's picture

There's a long road between "reviewed by someone on the security team to an unknown extent and deployed on drupal.org" compared to "security team reviewed this and is happy with it."

I take both your points. To

Liam McDermott's picture

I take both your points. To an extent I had the same thoughts writing this up, but since these things were mentioned a few times I thought they should be included.

Thanks, and sorry if these are myths about the security team you have to debunk all the time. :)

Could this be qualified as a student project for GSoC?

syntropy's picture

The drupal.org redesign project has been deprecated since the new site launched and, obviously, the project metrics were not ready for prime time and thus were not included in the new site.

The project module commit log indicates that hunmonk has started to work on this long wanted feature. I would be very interested in participating in this effort and prepare a Google Summer of Code proposal that would focus on getting the project metrics working.

My question is what do you (the project module maintainers and others interested) think about a GSoC around the project metrics project? Do you think it would be interesting?

Honestly, this isn't that far from being completed

chrisstrahl's picture

During the redesign, hunmonk actually showed off a lot of working code for this sub project. Unfortunately, per syntropy's comment, we didn't release it with the redesign launch because there was still more work to do. However, much of the actual leg work on this has been done, and now it's up to others to carry it through to completion. I believe most of what is left to do is to clean up / optimize some of the code that was written and develop a way of visualizing the results from the code in some meaningful way (hunmonk demoed it to us using the Google Chart API). The demo seemed to be pretty fully featured and good to go - with the obvious needs now being around answering the question "what do we actually want to show on drupal.org?"

Honestly, this was one of those things that I was really sad didn't make the cut for the redesign launch as it was pretty awesome to look at during the demo, and it would be great to have.

status update

hunmonk's picture

at drupalcon chicago, bdragon, dww and i worked on the last cleanups/testing of sampler module (the module that will handle metrics collection and storage), and the four metrics that were slated to be rolled out for the redesign. all this is ready to deploy on d.o as soon as we can coordinate with the infrastructure team on exactly how they want it done.

mind you this deployment only involves the collection and storage of the metrics -- display will still need to be worked out. but it's still a huge step in the right direction.

I'd be happy to help roll

basic's picture

I'd be happy to help roll this out on the infrastructure side, what will rolling this out require?

Flot?

Jelle_S's picture

I'm probably biased, but why wasn't the flot module considered for the charting part?

The Flot module was

mikey_p's picture

The Flot module was considered, but found to not contain the right feature set for our data. However we will be using the Flot library with the Views Sparkline module. Views Sparkline uses the flot module to provide the capability to render Sampler data in a meaningful way.

Issue tracking and software releases

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: