Memetracker module proposal round 2

Posted by kyle_mathews on March 30, 2008 at 7:26am

Round 1 of this proposal can be found here.

I'm putting up a rough draft of my proposal for review. This, I repeat is a rough draft, so there are some obvious flaws and missing pieces.

What I'm hoping from your review is that you'll read through my proposal (quickly, as it's long) and tell me what my proposal is missing and what unanswered questions you have. I wish the proposal was in better form so as to invite more thorough reviews but it's late, I'm tired (I've been writing for 9ish hours this afternoon / evening), and I have no time essentially to write tomorrow. I'll update this page early Monday when I have some more time.

Having said that, enjoy the proposal!

Executive Summary:

I propose to write two modules for Drupal as part of Google Summer of Code. One called memetracker and the other called machinelearningapi. The memetracker module will use the algorithms in the machinelearningapi to intelligently filter and group content from designated content sources both internal and external. The module's purpose is to find and display to a community in real time the most interesting conversations and memes within the community as they emerge.

My project will emulate functionality of successful commercial memetrackers such as Techmeme, Google News, Tailrank, and Megite. It will be an open-source implementation of memetracking technology that can be easily plugged into drupal-based community sites.

Benefits to Drupal/Open Source Community:

The memetracker module will solve a common problem for many online. Perhaps the best way to explain how my module will do this is by explaining my experience with this problem and how the memetracker module will help.

What's my Problem?

For eight months now I've been building social learning websites for classes at Brigham Young University where I'm a graduate student. Working with a few professors, I build websites for their classes to be used as a learning tool by them and their students. Our goal is to understand what web2.0 tools / principles help most in education.

Any online community, like the ones I helped form with my class web sites, generates lots of content and conversations. As a community grows, a problem that soon emerges is how to help individual members find the content/conversations they are most interested in. If there's only a few members in the online community, it's easy to follow every conversation. But if there's 100s or 1000s of members, it soon becomes impossible to follow or find every interesting conversation.

In researching ways that online communities help participants find the most interesting content, I've found three patterns which help.

First is the small-world pattern. Via Organic Groups, you split conversations up by topic. Members congregate around only the groups they are interested in.
Second is the Twitter pattern. Using buddylist or user_relationships, members follow friends or people who's ideas they find interesting.
The third pattern is for members to read the most interesting memes as they are somehow determined by the community (this is where my module will help out).

Drupal sites normally implement the third pattern by having human editors manually promote the most interesting content to the front page. But I've found this solution to be inadequate. Human editors miss many interesting pieces of content, are prone to bias, and the solution is labor intensive. Also, promoting content to the front page doesn't group the content by conversation or meme, an important usability improvement.

A much better solution for finding the most important memes are tools called memetrackers like Techmeme, Google News, Tailrank, Megite, and others. These memetrackers automatically find the most interesting content bubbling up in the blogosphere. Using these memetrackers, news readers can find in minutes the best content from 1000s of sources.

As the principle tech memetracker, Techmeme plays a hugely influential role in the tech community. Techmeme is the place to find conversation about the biggest memes of the day.

But at the moment there are only commercial memetrackers such as Techmeme. There are no free or open-source implementations and certainly nothing that can be easily plugged into Drupal.

My proposal then is to write a memetracking module that will fulfill the same role as Techmeme does to the tech community for any community website where it's installed. My module will intelligently filter and group community generated content to display to the community in real time the most interesting conversations and memes as they emerge.

I see a few classes of potential users of my module:

Those building community sites using Drupal. They face the same problem I related in my story earlier– how to help their users sift through new content and to find interesting conversations to follow and join in.
Those who want to build sites that support an online community. They'd use FeedAPI to pull topic from any number of relevant sources and then use memetracker to display interesting content and conversations from that wider community.
Users who would use Memetracker as a personal news aggregator. They'd add nes sources they are interested in to Drupal and memetracker would learn their specific interests and filter and display only the news they are interested in.
Organizations who use it as a news gathering resource for an internal blogosphere / outside news of interest to the company.

This module will be especially useful to the open source community. Open source projects rely entirely on internet communication technologies. In fact many commentators have linked the explosion of the open source to the opening of the Internet to the general population. Memetracking software is additional step in the evolution of communication technologies on the internet. It will allow any open source community, large or small, to put up a simple Drupal site and start aggregating developer blogs, forums, mailing lists, etc. into a centralized place.

Another analogy: memetracker = smart aggregator

Another way to think of my proposed module is it'll be a smart aggregator. An aggregator is software like Drupal Planet or any of the many other PlanetPlanet installations out there that aggregates related content from many sources to display in a central location. But current aggregators are dumb. The dumb aggregator knows no better than to pull in new content and order it chronologically.

A memetracker is a smart aggregator. It also knows how to pull in new content as a dumb aggregator but it's much smarter about how it displays the content. It can analyze the text of the content and know which authors are talking about the same topic and then group them together. But not only will the smart aggregator group similar topics but it will also learn what topics and authors are most interesting to members of the community and display those first. No more will you have to scroll through a long list of content skipping over content your not interested in but instead, new content will be nicely organized by meme and interestingness in a neat compact form.

Project Details:

To help you understand how the memetracker will work, I'll walk through the steps the memetracker will take from first aggregating new content to outputting a view to the user.

High level overview

The memetracker will assemble content from two sources (internal content through Drupal and external content through FeedAPI). It will then analyze the content to identify active memes. Then, using what the memetracker has learned from the click history, the memetracker will decide which memes to display and which to discard and which order to place the memes on the page (i.e. the most interesting memes will be at the top). UI code will be written to display the memes in an easy-to-browse fashion.

Detailed walk through

So say a particular memetracker is tracking 100 sources (30 are blogs on the Drupal installation and 70 are various blogs and other news sources you are aggregating with FeedAPI). In the past two days, the 100 sources have created 300 pieces of content. The first pass through the 300 pieces of content will be to find memes. First it'll check for intralinking between content -- this indicates they are discussing the same meme. Second is to perform textual analysis to determine how "close" text is to each other (cluster analysis). Also possibly related tags will be used as well in identifying memes – but tags are not a reliable way of sorting content as the quality of the tags depend on the diligence of the content author.

In my example, say the algorithm identifies 10 meme clusters in the 300 pieces of content. 80 of the 300 pieces of content are part of the meme clusters leaving 220 individual memes, not associated with any other content.

You've set the memetracker to only display ~50 links at a time to avoid information overload for those browsing. These means the memetracker has to discard 250 links from the display page. It does this filtering by various means.

First, the memetracker is biased toward keeping meme clusters (remember the goal is to display the most interesting content -- if two people thought a meme worth writing about, odds are that meme is more interesting to the general community then a meme that only one person talked about) So the memetracker will weight links in meme clusters higher.

Second, it will use a form of authority ranking for the different sources. If one blogger consistently writes interesting content that many people click on to read, then the memetracker will rank any new content by that blogger higher then new content by a less interesting blogger.

Third, you'll filter out topics not interesting to the community. Baysian logic is a possible way to do this. Baysian logic is often used for spam filtering. So just as spam filters learn that emails with "XXX" or "Hot chicks" probably are spam. If the community consistently clicks to read content about Drupal and not Plone, a new article about Drupal will be displayed and the Plone article won't.

Forth, the meme tracker will use click momentum. By this I mean it'll take a measure of how many times the content has been clicked on in the last while. If an article is being clicked on a lot, that suggests the article is more interesting to the community and should be moved up on the page.

Fifth, I will implement a hook that will let other modules set simple rules to affect the filtering. For example, if content from one source must always display, say blog entries from the CEO, you could set a rule that content from this source will always be kept and not filtered out.

So once the memetracker has grouped and filtered content down to the proper level. It'll pass data to the UI code to be rendered and pushed out to the browser.

Talk a little about how UI will look – Use screenshot of techmeme as an example

How will the memetracker learn?

The SoC application template asks what aspects of my proposal depend on further research or experiments. The machine learning portion of my proposal, much more than any other part fits this area. Machine learning is researched very actively and many algorithms are now well understood but deciding on the correct algorithms that will fit the requirements of a web application (fast response times, low resource usage) will take considerable research and experimentation. But while the specifics are still unknown, the general direction I'll take is clear.

The general machine learning technique I'll use is called reinforcement learning. Reinforcement learning is that the algorithm learns by making a guess and then gets feedback. If the guess is good, that “state” is reinforced, if the guess is bad, that “state” is weakened. Or in the case of the memetracker, it will guess what memes are interesting to the community. It will guess what order to place the memes on the page. Then based on feedback from the community (I.e. what links are clicked on or not clicked on) the memetracker will learn gradually how to select and display the most interesting content.

There are two major machine learning problems to be solved. Filtering out uninteresting content and clustering or grouping content into memes.
When filtering, the machine learning algorithm basically has to answer the question, will this piece of content on this topic written by this author be of sufficient interest to the general community to include on the Meme Browsing page? There are a number of algorithms that I've looked at including naïve bayes, backpropegation, and support vector machines. Each will filter adequately for the memetrackers needs, the question to answer is which will perform best under the constraints they'll be under.

The second machine learning problem, clustering or grouping content into memes is more straight forward to solve. I'll use a technique called agglomerate clustering. This algorithm starts with each piece of content as part of its own meme. You set a threshold which if two pieces of content are closer than the threshold, they are joined together into a single meme. The algorithm loops through the content joining all the content together that is closer than the threshold.

Using the above two techniques plus reinforcement learning will create a memetracking system that fits itself to the needs of the community it supports.
In implementing these algorithms, I'll have the support of the excellent machine learning faculty in the Computer Science department at Brigham Young University. I've talked to one professor extensively about my proposal and he (and most likely one of the other professors and several graduate students) is willing to mentor me as I write the machine learning code.

Risk of failure:

Low. Writing a working memetracker is not difficult. The devil, as they say, is in the details. Building a great memetracker is a very difficult task. I don't expect to have built a great memetracker by the end of the summer. Instead, my plan is to lay the foundation and build the necessary pieces (or include an API) that eventually great memetrackers will be built. But I fully expect to have an adequate memetracker working at the end of the summer.

Deliverables:

Ahhh!!! - I haven't finished this part completly. I'll put stubs for the time being.

machine learning api
memetracker

admin UI
Simple UI for browsing memes

Hook for modules to influence filtering

Hook for javascript widgets

Project Schedule:

Yes, this needs to be more detailed...
I'm done with classes by the end of April and plan to start working GsoC immediately. In the first 2-3 weeks of May, I will write a very rough version of ui + machine learning code. In spirit of release early, release often. I'll install the memetracker on a public facing website. I'll then put together many many memetrackers to track different communities of bloggers / other online communities. Econ bloggers / liberal bloggers / conservative bloggers / edubloggers / drupal stuff / joomla stuff / Ubuntu stuff / and a bunch more.
My goal is to get lots of attention from these different communities so as to get as many people to use the memetrackers as possible. This will enable me to rapidly test and iterate through 100s of variations on different machine learning algorithms testing their performance against a large number of different types of communities.
I'll be working mostly on improving the machine learning algorithms through the latter part of May and all of June. By the end of June I hope to have gotten the machine learning code to a very mature state.
From there I'll turn my efforts to bug fixing, filling out test coverage, documentation, and improvements on the admin UI and meme browsing UI.

Biography:
I was going to include what I had so far for my bio but I reread it and it's too ugly. I'll add it in when I have time to rewrite it

Comments

I recommend this application

Posted by sime on March 30, 2008 at 7:50am

I recommend this application is becomes:
"Student proposal -- community review complete"

after reading the interesting discussion here.

This looks really solid to

Posted by catch on March 30, 2008 at 12:21pm

This looks really solid to me.

The only bit I have questions about really is the UI. It'd be good to see if it could be done using nodequeue (which can then be displayed by Views). Having the memetracker create new queues/subqueues and push content into them means that admins can use a common UI to define and display queues pretty much any way they like.

Since you mentioned possibly using terms at some point (I use 'similar by terms' on my site and it works really nicely - although there's only about 20 people adding the terms and we're quite careful) - it might be worth trying to make this pluggable with some of the methods already out there (taxonomy, voting api, pivots etc.). There's a bunch of related/recommended content modules already: http://www.civicactions.com/blog/similar_content_module_wrapup Along with the pivots module which seems to be doing things along similar lines as well: http://drupal.org/project/pivots A lot of these modules try to do similar things in different ways, but afaik there's no way to use them in tandem, nor much in the way of nodequeue/views support.

I do plan to expose memetracker data to views

Posted by kyle_mathews on March 30, 2008 at 4:19pm

I do plan to expose memetracker data to views (That's a big chunk missing from my write-up. . . it'd be nice if the world would sit still while GSoC applications are finished). Beyond that I don't plan on having time to integrate memetracker with other modules this summer. The advantage of integrating my module with views however, (Bill Fitzgerald recommended this path to me) is it'll ensure that all the metadata my module generates about the content it processes will be exposed via a clean set of APIs. So going through the process of exposing my data to one module (views) will make the same process of integration much easier for me (or others) to integrate memetracker with nodequeue, the related content modules out there, and anything else.

Kyle Mathews

This just hit me know --

Posted by bonobo on March 30, 2008 at 7:28pm

and I feel pretty silly that I didn't ask this/realize this earlier --

How will memes be stored? In thinking this through, it seems like there are a couple immediate possibilities:

Memes, once identified, get stored as nodes -- items relating to that meme are tracked as child nodes/related nodes -- the advantage of this is that views/nodequeue/panels/access control is supported out of the box -- the disadvantage is that it could limit the scope of what you have in mind for memes.
Memes get stored as taxonomy terms, and nodes relating to a specific meme would be tagged with that term.
Memes get stored as something else entirely, distinct from nodes and taxonomy.

For this stage, this is probably a bit far into the implementation phase, and with that said, this is a strong, well-articulated proposal.

Cheers,

Bill

FunnyMonkey
Tools for Teachers

FunnyMonkey

memes as nodes (or anything else)

Posted by kyle_mathews on March 31, 2008 at 1:43pm

At the moment, I'm not sure that "memes" can be stored as individual items at all. The reason I say that is a meme isn't universally identifiable. What I mean by that is a meme is defined by its context so from memetracker to memetracker, depending on what sources are being pulled in and a number of settings within memetracker, different "memes" will be identified, even given the same content.

An example. Say you have two memetrackers, one tracking the Drupal community and another tracking web developers in Portland Or. Two Drupal developers both post one day on Drupal, one with a recipe for FormAPI and another with some optimization tips for high-traffic sites. Both developer's blogs are pulled in by both the Drupal memetracker and the Portland Web Developer's memetracker. The Portland web developer pulls in the two blogs first and says, "Oh, they're both on Drupal, most content isn't on Drupal so these two pieces of content must be on the same meme." But then the Drupal memetracker looks at the two blogs and comes to the opposite conclusion because everything it looks at is on Drupal so for two pieces of content to be part of the same meme, they must not only both be about Drupal but also have considerable other overlap.

So because memes are fragile ethereal things -- I hadn't considered storing them directly as nodes. I do see the value of storing the memes in some form to create an historical memory of the communities conversations. One feature of techmeme I really like that does this is it creates snapshots of the memebrowsing page like this snapshot from when Drupal 6 was released -- http://www.techmeme.com/080213/p99#a080213p99

That would be fairly easy to implement and would provide an easy way to browse past memes.

But having said all that, I've never seriously considered storing memes as nodes. It'd certainly be doable -- I'll have to think more about what value it'd bring and what'd be the best way to implement that.

Kyle Mathews

The storage of memes was

Posted by catch on March 31, 2008 at 2:04pm

The storage of memes was what I was thinking about when I mentioned nodequeue - create a queue per meme, put nodes into it (and this gives you views integration for free). I think there's some feature requests against nodequeue to take a 'snapshot' of a queue at a particular time as well.

A quick idea with little thought behind it :)

Posted by bonobo on March 31, 2008 at 4:09pm

So take this for what it is worth...

How about a meme as a combination of taxonomy terms (both user and machine generated), timestamp, search index results, and (as yet undefined) user interactions?

This would allow one piece of content to belong to multiple memes, and for "definitions" of memes to shift subjectively, depending on user (ar admin) defined criteria.

Thoughts?

FunnyMonkey
Tools for Teachers

FunnyMonkey