Rework Memetracker/CRE modules and write Profile API

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
kyle_mathews's picture

[It looks like I'm a bit late to the party :) But please, anyone who is interested in this functionality, review and rate this project so it'll be ready to roll come next spring. . . thanks!]

All organizations, large and small, have a vital need to deliver relevant and timely information to its members. My project will be to make it possible for organizations to easily meet this need. I will improve two Drupal modules, write a new module, and write documentation so that organizations can easily add sophisticated news aggregation and recommendation tools into their Drupal website.

The two modules I will improve are Memetracker and Content Recommendation Engine. I will write a new module, Profile API, which can be used to create profiles that will be used in content recommendation.

I wrote Memetracker as part of the 2008 Google Summer of Code. The Memetracker module uses machine learning algorithms to intelligently filter and group all types of content. The module's purpose is to find and display to a community in real time the most interesting conversations and memes on relevant topics as they emerge.

My goal for the memetracker module is for it emulate functionality of successful commercial memetrackers such as Techmeme, Google News, Tailrank, and Megite. I want it to be a robust open-source implementation of memetracking technology that can be easily plugged into Drupal-based community sites.

The Content Recommendation Engine module is designed to provide personalized content recommendation. It learns what types of content individuals are interested in and recommends new content as it comes in.

Both modules are powerful ideas but need quite a bit of work to be usable in real-life situations.

Some specific things I'd like to change.

CRE module:

  • Make recommendation more flexible. Right now it only considers data from the VotingAPI. This is unnecessarily limiting. CRE's goal is to learn what kind of content an individual likes and to then suggest additional content CRE guesses might be interesting. CRE should be able to learn from things such as what authors or feeds a person likes, what topics they're interested in, what articles are interesting to other people similar to them, etc. Also, it will be integrated into the Profile API to further refine content recommendation. A simple example. If Bob always clicks on articles about Mac computers, then any new articles on Mac computers will automatically be added to his feed.

  • Make CRE much more robust under heavy loads. CRE doesn't scale well.

  • Integrate with Views: I think both CRE and Memetracker are natural fits for integrating with views. It would simplify the process of learning how to use these modules for new users and also immediately add a great deal of flexibility to CRE/Memetracker in creating custom outputs. This past summer when I was stymied when creating the admin interface as contemplated the 100s of potential memetracker types I'd have to support with my UI. In Views, these custom memetrackers would be very easy to create. A memetracker which includes two node types written by three authors as well as two different feeds would be easy to do in Views but it would be difficult to create a custom admin UI to do the same thing.

Memetracker:

  • Integrate with Views

  • Turn memes into nodes

  • Create "archive view" so you can view memebrowsing pages from the past -- what was the hot news on December 15, 2007.

  • Add classifier algorithm to memetracker which will automatically place incoming articles into separate memetrackers. For example, you have an
    agriculture site and wish to have a memetracker on vegetable, fruit, fertilization, and the farming business environment. Any news source that you would aggregate to be part of these four memetrackers would contain articles for more than one of the four memetrackers. I would add an interface such that you could train the classifier algorithm what articles are appropriate to which of the four memetrackers. (See this issue: http://drupal.org/node/292561)

  • Import and display images for memes (See this issue: http://drupal.org/node/283752).

  • Simplify the memetracker install process (currently it requires installation of several python libraries, ideally I would rewrite the python code in c or php and ship that in Memetracker tar ball).

  • Detect interlinking between content

  • Add hooks so other modules can easily write their own rules for filtering and sorting memes. For example, a company might want that their intranet always displays nodes of type announcement at the top of the page for a certain time period. Or a branded news site might want news from their companies to be ranked higher than it would otherwise.

  • Fill out testing coverage.

Profiler module

The profiler module would be a small module that would allow the site’s admins to create some “questionnaires” for users. Their answers would be fed into the Content Recommendation Engine as a starting point for their content interest profile.

A music site might ask a question like:

Choose between the following :

  1. Wu Tan Clan
  2. Rolling Stones
  3. King Crimson
  4. Moby
  5. Céline Dion

Using answers to questions like this, the system would find users with similar profiles. This means, for example, that the site can try to send personalized content to new users right after registration.

How does your proposal meet the stated goals of the Knight Drupal Initiative program?: 

There are many tools available to communities to aggregate and distribute information. What's missing are open source tools which leverage not just human intelligence to filter content but also artificial intelligence.

There is far too much information generated daily for any person or organization to sort through manually. These automated tools can be thought of as pre-processors that improve the signal-to-noise ratio reducing the stress people endure trying to follow news. By filtering out the noise, important news is much more likely to be identified and acted upon.

My project will most directly meet the third goal of KDI, "To encourage people to improve their communities by supporting the free exchange of information and ideas." I believe these tools will become the basic building blocks of a rich flowering of content aggregation / filtering web applications.

How long will your project take to complete?: 

I estimate the whole project will take 3 months.

1 day to port CRE to Drupal 6.
2 weeks to rewrite architecture of CRE so it's not solely dependent on data from VotingAPI.
1 week to investigate and fix bottlenecks in CRE
1 week to write and test the Profile API
1 week to investigate and test new algorithms for Memetracker
2 weeks to integrate CRE and Memetracker with Views
1 week to remove the Memetracker Python dependency.
1 week to write code to detect interlinking between content.
2 weeks to complete other tasks on the Memetracker module.
1 week to write documentation.

How will you implement and distribute your project?: 

I (Kyle Mathews) will be the main programmer on the project. Depending on my time availability when (if?) this project is approved, I may ask other Drupal community members to take on parts of the project.

The three modules will be hosted on Drupal.org.

What is your total budget estimate and how much funding are you requesting: 
  • $40,000 for 1 programmer at $100 / hour
  • $40,000 total request

Comments

Very interesting project, a

patchak's picture

Very interesting project, a great addition to Drupal, exactly something that I was needing! I hope this project gets the attention it deserves cause it's a really important aspect to take Drupal to new levels IMO!

Patchak

Cool project

nickvidal's picture

Hi Kyle,

I think your project is really nice! It explores key areas in CS and it helps people find information more easily!

Gave it a 5!

Best regards,
Nick

+1

julma's picture

Really interesting, I hope your project will be selected.

++1

sumitk's picture

Loved your proposal
Wonderful ideas to implement I am very interested in this. :)

cheers!!
sumit kataria
www.sumitk.net

1++

dejb's picture

This would be awesome. Exactly what I'm looking for.

can it be done?

lasirena42's picture

The idea is good. But I'm concerned whether this can be done eventually. Both Memtracker and CRE were supported by Google SoC for 3 months each. But the code was still buggy. I was quite frustrated when I tried to use them on my site. What's more, the proposal claimed to use complex machine learning algorithms, which are not easy. Unless the author can demonstrate he has strong enough qualification to do the job, or that he can build a team that has the required expertise, I'm not convinced that this proposal would turn into a successful Drupal module.

$40,000 is a lot of money. I'd rather see it make some real contributions to the Drupal community, rather than taking the risk of a failed project.