Automation with Project Applications Scraper and Goutte

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
klausi's picture

The recent little review bonus shitstorm has been a bit of a wakeup call to me. I thought a bit about our goals for the project application process and of course we all know what we want: automation, automation, automation. Last summer I thought that we should postpone that until drupal.org is migrated to Drupal 7 to not waste any time on Drupal 6 coding, but unfortunately the drupal.org upgrade takes longer than expected. So I did not work much on automation (except for a few coder sniffer and pareview.sh improvements) and handled the issue queue manually.

You know the good old Drupal saying: talk is silver, code is gold. I want to get back into automation efforts, so I started to implement Project Applications Scraper. It is a web crawler script that uses Goutte to post comments to the drupal.org issue queue. I have it deployed as a cronjob on a server and it runs every 6 hours. Currently it performs the following tasks:

  • Set a "needs review" application to "needs work" if a link to the sandbox or the Git repository could not be extracted from the issue summary.
  • Post a hint to the review bonus program to "needs review" applications that did not receive such a hint yet.
  • Set a "needs review" application to "needs work" if there has not been any automated review link to ventral.org posted and the result of pareview.sh exceeds a threshold of 30 lines.

Further planned tasks:

  • Check if an applicant has multiple applications and close all but one as duplicates.
  • Automatically close "needs work" applications after 10 weeks with no response.

The scraper operates currently under my user account, do you think I should create a separate robot account? And should an automated message contain a hint like "this was an automated post by Project Applications Scraper"?

I know, the sandbox lacks a bit of documentation right now how to install/setup the script, that will follow later as it matures. Anyway, feedback welcome!

Comments

Amazing work. Do you think I

greggles's picture

Amazing work.

Do you think I should create a separate robot account?

Yes, please.

And should an automated message contain a hint like "this was an automated post by Project Applications Scraper"?

Yes, please, with a link somewhere for more information.

Done, created the PA robot

klausi's picture

Done, created the PA robot user account: http://drupal.org/user/2515270 . I also added an automated message hint and I implemented multiple application detection.

Great work ... but I wonder

jthorson's picture

Great work ... but I wonder about the use of an external scraper, when a large number of these features could be built right into a custom module directly on Drupal.org.

Checks such as the 'duplicate application' check should occur at validation time, preventing the opening of the duplicate module in the first place. Auto-closing of applications could easily be built into a cron run. Auto-followup comments can be performed through project* ... I'd encourage a look at http://drupal.org/sandbox/jthorson/1367220, which has the first feature already built, and enough information regarding the relevant data structures to make the next two trivial to implement.

External tools such as ventral.org are great ... but our goal should be on eventually internalizing those capabilites, avoiding the silo scenario where only one individual has access and know-how as to how the tool actually operates. By bringing the tools in-house, we can also formalize the support and maintenance of them via the infrastructure and/or drupal.org teams.

I know your sandbox and that

klausi's picture

I know your sandbox and that is the way to go in the long term, but right now we absolutely cannot add more features to the Drupal 6 installation on drupal.org. That would increase the upgrade burden to Drupal 7 even more.

Besides that I do not want to touch any Drupal 6 code, when you can do such beautiful things with Composer and Goutte.

So let's re-visit the possible improvements to project.module and friends once drupal.org is on Drupal 7. I want something working right now and I don't want to wait, so the scraper seems to be an appropriate short term solution to me.

I guess we disagree on the

jthorson's picture

I guess we disagree on the 'wait for D7 d.o' front. Perhaps I'm somewhat jaded, given that almost everything I have been trying to accomplish was already stalled on the drupal.org D7 port for nearly a year before the port itself stalled out.

The projectapps integration is a relatively straight-forward and self-contained thing ... initial development would only take a few hours, and porting to D7 about half that (once the project* interface changes are fully defined). In the end, I see the extra hour or two of porting work being more than worth it, when the alternative is postponing any progress on drupal.org until after a yet-undefined target date, which by all probability is still months away.

The Drupal Association seems

klausi's picture

The Drupal Association seems to agree in the recent blog post that Drupal 6 on drupal.org is now frozen. It does not contain a time frame yet, but I expect that the Druapl 7 port might not get finished in 2013. So my hacky scraper scripts will serve us well in the meantime, let's discuss server side implementation in Drupal 7 again in 2014.

I've asked a few folks, and

jthorson's picture

I've asked a few folks, and the expected timeframe would be quicker than that ... there's a proposed schedule floating around that suggests between 4 and 8 months at the extremes. And the blog post defines the scope of what the DA and D.O. porting team will be doing. It doesn't preclude others from developing new features; or having them ready as fast-follow additions once the migration is complete ... I'd argue that now is the time to discuss things, so that we can have them developed and ready for the fall ... pushing off discussion until 2014 likely means that nothing is implemented for another 6 months or so after that. :)

Really cool script! I have

fuzzy76's picture

Really cool script! I have been thinking of creating some sort of statistics dashboard for the Project Applications queue, and Goutte seems to be of great for that. :)

Is autosubmitting new

fuzzy76's picture

Is autosubmitting new applications to http://ventral.org/PAReview or running the same kind of checks locally an option?

I totally agree on internalizing this long-term, but for now I think any hack that helps filter the queue to be of help.