Introduction to automation for content checking

kimmel's picture

This message has been cross posted to Drupal.org Improvements, Documentation Team, and Drupal.org Testing Infrastructure.

tl;dr

I want to use automation to improve d.o content quality. I do not have access to do this and I have been unable to get someone with access interested enough to get this project moving.

The long version

My vision for improving the Drupal Documentation and drupal.org can be broken into three equally important ideas:

  • Incremental improvements - Small improvements more often keep the audience happy instead of waiting longer for major changes. People have a short attention span so to keep mindshare we have to pop up on the radar more often.

  • Exploring the solution space - Encourage people to think outside the box and test new things out. Make it easier for people to tinker with the docs as a whole. The continued growth of the site requires new scalable approaches to solving problems.

  • Not wasting the volunteer’s time - “Producing Open Source Software” by Karl Fogel states: “Try not to let humans do what machines could do instead. As a rule of thumb, automating a common task is worth at least ten times the effort a developer would spend doing that task manually one time. For very frequent or very complex tasks, that ratio could easily go up to twenty or even higher.” I firmly believe there are more important tasks that are still approachable for beginners other than basic spelling and grammar mistakes. Everyone’s time is important and trivial tasks that can be automated away will be automated away. It ensures both a higher level of quality as the computer does not make mistakes and the volunteer is freed up for tasks that computers cannot just solve like evaluating style and determining if information is missing.

With those three ideas in mind I started thinking about what I could personally do to achieve these goals. I came up with the followings tasks:

  • Create an automated process for checking page titles on drupal.org - https://drupal.org/node/1441074 . This should simply be a page listing all the page titles that still need to be fixed. Having users hunt around for stuff that is broken is a waste of time.

  • Have an automated test check the encoding of all the Drupal docs to make sure everything is uniform. This can also go into character usage and proper font selection for other mediums.

  • Do some analytics on what docs get viewed the most. Ensure those docs have a higher quality standard that is not simply locking the page and then making users bounce around the issue queue to get a page updated. This always leads to delays and volunteers losing interest.

  • Increase the use of visual aids in documentation. There is an entire field of information visualization research we can use to better communicate information. Plain text is not the best method in all cases for everyone.

  • Start a process of automated spell checking, grammar checking, and style guide checking. This allows a guaranteed level of quality

  • Automate the collection of comments that should be approved for removal. Basically find the “me too” and spammy comments so a person can quickly verify them.
  • https://drupal.org/node/1426262 - Find comments that should be in the documentation and set them up for beginners to work with.
  • Software has been built to check comments and has been tested on youtube comments.

  • Automate checking the reading level, verb tenses and other key metrics that determine quality writing. These natural language processing (NLP) tasks are solved via existing well testing libraries. None of this is even remotely ground breaking.

Now this is not the first time I have brought up automation as a solution for maintaining and improving the Drupal content. I tried a few times last year but it always dead ended at the same spot which is where I find myself now. I have a well thought out plan that I can execute that will help the community. It comes down to access. I need the correct kind of database access or database dumps to run tests and make corrections. Having a bot do it through the web interface will just slow d.o down.

Furthermore, this should help reduce the number of items in the issue queue.

References:

Comments

I think the idea of analyzing

robbt's picture

I think the idea of analyzing the Drupal Documentation and automating as much of the quality control as possible is a good one.

I would think that the place to start is the idea of doing analytics to find the most viewed docs and improving quality control. What specific ideas do people have as far as alternatives to locking the pages ? Certainly putting things into an issue queue creates a lot of obstacles.

How modular are the content editor permissions on drupal.org ? Obviously preventing spammers from defacing critical documentation is important but it also makes sense to keep the documentation improving at all times.

As far as the automated spell checking, font-checking and the like are there any drupal modules that could do this ?

How would we accomplish this and what level of access would be needed to implement this solution ?

It may not be the best idea to simply run a spell-checking script that checks and corrects mistakes as it may end up doing some harm as well as improvements. But if we could analyze and highlight suggested changes to users or editors this might help improve the state of the docs quite a bit. Even if a module that simply read through the docs and created a list of the pages that are most in need of improvement.

Documentation is always one of the most challenging aspects of an open-source project because many coders don't spend the time to fully document their modules and there is a lot of gaps in terms of proper documentation and incomplete documentation. Perhaps if there was a rating system for modules and their respective documentation that people could strive for it would help create an incentive and awareness about the need for documentation. Such as a standard baseline documentation that accompanies each project. If they don't fill it out this is given a percentage score. Then you could also implement a sort by completeness of the documentation of each module. Just an idea.

I just wanted to chime in and offer some thoughts on this. I think that figuring out ways to automate the process and encourage the original authors of documentation to do as thorough a job as possible in the first place will potentially reduce the workload for documentation moderators and make things work more smoothly. I mean a lot of module developers don't even document their modules at all via drupal.org you need to read a README or INSTALL file inside of their module zip. While this is great for users that are savy it would be beneficial in my opinion to encourage them to also add this level of documentation to drupal.org in some matter.

Great idea! Check out

cweagans's picture

Great idea! Check out http://drupal.org/make-drupalorg-awesome for documentation on making changes to Drupal.org. Basically, you just need to request a dev site, write your code, and get it reviewed and deployed. Additionally, you can download the drupalorg_testing profile as an example of how Drupal.org is configured. You should be able to write your content automation stuff as a module.

Right now, new features are not being accepted unless there's also a Drupal 7 version, so you might be better off writing for D7 and waiting till Drupal.org gets upgraded (in progress) to deploy your new code.

--
Cameron Eagans
http://cweagans.net

Following the instructions

kimmel's picture

Following the instructions here https://drupal.org/node/1018084
I requested a dev site https://drupal.org/node/1409706 12 months ago but could not get any movement on the issue.

Yeah, you'll need

cweagans's picture

Yeah, you'll need significantly more detail on that issue. For instance, what is your planned approach? Will it include installing new modules on Drupal.org? What sort of new loads will it introduce on the infrastructure (content scanning cron jobs? new code that has to be executed when people comment on things? etc.)? How will this affect users of Drupal.org, if at all?

You have a very good amount of information in this GDO post. You should copy/move this info into your development site issue.

You could also consider asking in #drupal-infrastrucure. drumm may be able to help you.

However, I do think that you could develop this as a standalone module on a local development site on your machine and accomplish the same goal.

--
Cameron Eagans
http://cweagans.net

I am developing a grammar

giorgio79's picture

I am developing a grammar checker service based on Drupal with an API at http://proofreadbot.com :)

You can see a sample report at http://proofreadbot.com/proofreading/132

I am currently working on making code contributions and adding of new checks super easy. Would love to have Proofread Bot get involved with d.o. content testing.

Wordpress blogs seem to be loving it http://wordpress.org/extend/plugins/proofread-bot/stats/

****Me and Drupal :)****
Clickbank IPN - Sell online or create a membership site with the largest affiliate network!
Review Critical - One of my sites

Documentation

Group categories

Event type

Post type

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week