Memetracker module proposal

Events happening in the community are now at Drupal community events on www.drupal.org.
kyle_mathews's picture

Summary:
I want to write two modules for Drupal as part of Google Summer of Code. One called meme_tracker and the other called machine_learning_api. The meme_tracker module will use the machine_learning_api to intelligently filter and group content from both internal and external content sources. The module's purpose is to find and display to a community in real time the most interesting conversations and memes within the community as they emerge.

Hello Drupliers. My name is Kyle Mathews. I'm a grad student in Information Systems at Brigham Young University working as a research assistant to several faculty members here. We are studying how and where social software can be used in education.

In the past eight months, I've built a number of classroom websites using Drupal. I've learned a considerable bit about Drupal in the process, became very involved in the community, and am writing a module (writing_assignment -- still a work in progress btw). I've loved most everything about Drupal but have found Drupal is missing an important component for building the perfect social learning website. This itches. So, in the best open-source fashion, I'm applying to Google Summer of Code to scratch my itch.

What's the Problem?
Any online community generates lots of content and conversations. As a community grows, a problem that soon emerges is how to help individual members find the content/conversations they are most interested in. If there's only a few members in the online community, it's easy to follow every conversation. But if there's 100s or 1000s of members, it soon becomes impossible to follow or find every interesting conversation.

In researching ways that online communities help participants find the most interesting content, I've found three patterns which help.

  1. First is the small-world pattern. Via Organic Groups, you split conversations up by topic. People congregate around only the groups they are interested in.
  2. Second is the Twitter pattern. Using buddylist or user_relationships, you follow friends or people who's ideas you find interesting.
  3. The third pattern is to read the most interesting memes as they are somehow determined by the community (This is where my module will help out).

Drupal sites normally implement the third pattern by having human editors manually promote the most interesting content to the front page. But I've found this solution to be inadequate. Human editors miss many interesting pieces of content, are prone to bias, and the solution is labor intensive. Also, promoting content to the front page doesn't group the content by conversation or meme, an important usability improvement.

A much better solution for finding the most important memes are tools called memetrackers like Techmeme, Tailrank, Megite, and others. These memetrackers automatically find the most interesting content bubbling up in the blogosphere. Using these memetrackers, blog readers can in minutes read the best content from 1000s of sources.

As the principle tech memetracker, Techmeme plays a hugely influential role in the tech community. Techmeme is the place to find conversation about the biggest memes of the day.

But at the moment there are only commercial memetrackers such as Techmeme. There are no free or open-source implementations and certainly nothing that can be easily plugged into Drupal.

I propose then to write a memetracking module that will fulfill the same role as Techmeme does to the tech community for any community website where it's installed. Using machine learning techniques, my module will intelligently filter and group content from both internal and external content sources to display to the community in real time the most interesting conversations and memes within the community as they emerge.

Other Potential Use Cases:
I think the memetracker module would find value in many areas outside of my initial community website use case. Some other ideas.

  • People could use software to make techmeme-like-sites for specific topical areas (or think enhanced PlanetPlanet type aggregation).
    • such as an news aggregator for Africa (Aggregate information from BBC, local media, blogs, etc.)
    • or an aggregator for the edublog community
  • a better personal RSS reader
  • A company rolls out meme_trackers for each employee (each Drupal user could have a meme_tracker). The employees subscribe to feeds (internal and external) they are interested in. Then the best content, as determined by all the company employees, is pushed onto individual employee meme_trackers automatically.

I have many more ideas I've thought of to improve/extend the basic concept but I'll share just one now.

With the RDF module coming along nicely, some amazingly cool possibilities would open up if individual memetrackers could talk to each other and share information via RDF.

A few examples of how that might work:

  • A school (like say, BYU) has a meme_tracker for each college within the university which aggregates content from blogs/og group postings within that college. A BYU-wide meme_tracker then pulls in the best content from each individual meme_tracker and finds memes/conversations underway throughout the entire school.
  • A blogger writes mostly about Ubuntu news. He wants to place block on his website that displays the top Ubuntu news of the day. He writes a bit of code that queries an Ubuntu memetracker for the top five articles of the day and displays them in a block.
  • My friends and I all read blogs with individual memetracker we've set up on a server. We want to share with each other articles we find especially interesting. While reading, we mark articles we want shared. Then every so often memetracker queries my friend's memetrackers and returns blog posts they've marked. These posts are automatically placed into my memetracker and tagged by the friend they came from.

I can imagine that many many more very interesting mashups between memetrackers would emerge if they all expose their information via RDF.

I'm really excited about my idea and would love to help bring it to life. I'd love now to hear feedback from the Drupal community about ways I can improve my proposal before I submit it to Google.

Thank you for your time.

NOTE:
A possible point of confusion. There are significant differences between my proposed modules and existing modules which recommend similar content (Similar by Terms, Content Recommendation Engine, Similar Entries, Conversation Pivots). These modules guide you to content similar to the node you are viewing (for the most part) irregardless of the other content's age. My module identifies "emerging" conversations and memes. It groups on a page different perspectives and thoughts on the hottest topics of the day.

Comments

...

icecreamyou's picture

Are you talking about intelligent aggregation of content on the website itself, or on external websites too? I see the value of doing this if it can draw content from external sources, but if you're thinking internal-only then what you want can be done more simply by a voting module like Fivestar.

I am personally more interested in the "Twitter pattern" as you call it, which is why I want to develop a "Trust" module if I ever find time to learn enough PHP.

both internal and external websites

kyle_mathews's picture

It'd aggregate from both internal and internal sources. External sources via FeedAPI. But it'd would mesh both internal and external sources seamlessly.

But to really get an idea what I'm talking about go look at techmeme.com for a bit. Notice its pattern. It'll have a main article at the top. Then just under a main article, it has a "discussion" section. These are blogs that link to the main article. Then, oftentimes, under that will be a "related" content section. These are other articles discussing the same meme.

I want this same intelligent grouping by conversation / meme to be available to any Drupal powered community website.

Kyle Mathews

Kyle Mathews

Great idea btw

kyle_mathews's picture

I really like your "Trust" module idea. I hope someone picks it up and runs with it someday. That's why twitter is so successful, it's very easy to ignore people you don't want to listen to.

Kyle Mathews

Kyle Mathews

I have allready written feedapi_node_discussion module

mustafau's picture

I have already written a module (feedapi_node_discussion) which opens a discussion section on any content coming through FeedAPI Node (part of http://drupal.org/project/feedapi) module.

It was meant to be a personal project however if there is need to this functionality I can publish it at drupal.org.

I'd love to hear more about your module

kyle_mathews's picture

I'd love to hear more about what your module does. Do you have a live example of your "discussion section"? I'm not 100% sure what you mean.

Kyle Mathews

Kyle Mathews

I don't have a live

mustafau's picture

I don't have a live example.

By "discussion section" I mean exactly what you said in your previous comment (http://groups.drupal.org/node/9730#comment-30468).

This module tracks links from feed item nodes to feed item nodes. FeedAPI Node module creates nodes from feed items. Every node created by FeedAPI Node has a URL associated with it. feedapi_node_discussion module extracts links from node body and determines whether there is a node associated with that link or not. When it finds a node associated with a link a discussion is created under that node.

Drupal on the sending end

kvantomme's picture

Hi Mustafa, it sounds like you have to set up the source site to send the comment links along with the main content, am I right? Does this only work with a Drupal site on the sending end?

--

Check out more of my writing on our blog and my Twitter account.

No, Drupal powers the

mustafau's picture

No, Drupal powers the receiving end. Content is collected via FeedAPI. Using FeedAPI one can import anything into Drupal.

Please post code somewhere

boris mann's picture

This sounds VERY interesting, Mustafa...please do post the code somewhere where we can trial it....

I have created a project

mustafau's picture

I have created a project page on drupal.org.

http://drupal.org/project/feedapi_node_discussion

I will upload the code sometime today.

Looking forward to check out

alex_b's picture

Looking forward to check out this project. Looks very promising.

likewise

bonobo's picture

This has the potential to address a series of issues --

Looking forward to seeing this.

Cheers,

Bill


FunnyMonkey
Tools for Teachers

Hi Mathew, this sounds very

alex_b's picture

Hi Mathew, this sounds very interesting.

The news tracker we're working on (http://www.managingnews.com/) is doing similar things by providing people with tools to track keywords and automatically tagging content.

Some questions:

How and on what level would the meme tracker work? Would it be a content analyzing layer that kicks in on content creation plus a UI for browsing memes? One method of identifying memes that you omitted in your blog post is tagging like practiced here on g.d.o. - although it's rather a crutch for this job as it entirely relies on the deligence of the author. Are you planning to use auto tagging content analysis tools like opencalais?

I would love to see this work integrating with FeedAPI (http://www.drupal.org/project/feedapi). Aggregation + meme tracking could be an awesome dynamic duo.

Alex

great questions Alex

kyle_mathews's picture

Alex, you pretty much got the right idea.

The memetracker will assemble content from different sources (internal from Drupal and external through FeedAPI). Analyze the content to identify active memes. Then, using what the memetracker has learned from the past click history, the memetracker will decide which memes to display and which to discard and which order to place the memes on the page (i.e. the most interesting memes will be at the top). GUI code will be written to display the memes in an easy-to-browse fashion.

So say a particular memetracker is tracking 100 sources (30 are blogs on the Drupal installation and 70 are various blogs and other news sources you are aggregating with FeedAPI). In the past two days (more recent news is more interesting. Part of the algorithm would be to decay the interestingness of a piece of content as it grows older), the 100 sources have created 300 pieces of content. The first pass through the 300 pieces of content will be to find memes. First it'll check for intralinking between content -- this indicates they are discussing the same meme. Second is to perform textual analysis. There are a number of algorithms for determining how "close" text is to each other. Also, related tags will probably be used as well in identifying memes -- but I'll need to test to be sure what algorithm works best in practice. But for my example, say the algorithm identifies 10 memes in the 300 pieces of content. 80 of the 300 pieces of content are part of the memes leaving 220 individual memes, not associated with any other content.

You've set the memetracker to only display ~50 links at a time to avoid information overload for those browsing. These means the memetracker has to discard 250 links from the display page. It does this filtering by various means.

  • First the memetracker is biased toward keeping memes (remember the goal is to display the most interesting content -- if two people thought a meme worth writing about, odds are that meme is more interesting to the general community then a meme that only one person talked about) So the memetracker will weight links in meme clusters higher.
  • Second it will use a form of authority ranking for the different sources. If one blogger consistently writes interesting content that many people click on to read, then the memetracker will rank any new content by that blogger higher then new content by a less interesting blogger.
  • Third you'll filter out topics not interesting to the community. I'll use baysian logic to do this. Baysian logic is also used for spam filtering. So just as spam filters learn that emails with "XXX" or "Hot chicks" probably are spam. If the community consistently clicks to read content about Drupal and not Plone, a new article about Drupal will be displayed and the Plone article won't.
  • Forth the meme tracker will use click momentum. By this I mean it'll take a measure of how many times the content has been clicked on in the last while. If an article is being clicked on a lot, that suggests the article is more interesting to the community and should be moved up on the page.

But really the possibilities are endless for what can affect the memetracker's internal ranking of content for interestingness. The above are metrics I've picked up from studing techmeme and from what I've read about memetracking in general. My plan at the moment is to create a hook which other modules can use to affect the ranking of content any way that suits the need of the particural community. The point again is to help online communities / individuals find web content that is most interesting to them by applying machine learning techniques which will learn the likes / dislikes of these groups and filter and group incoming content accordingly.

So once the memetracker has grouped and filtered content down to the proper level. It'll pass some sort of multi-deminsional array to the themeing code and then be spit out to a page.

Also because all information generated will be stored in RDF, you'll be able to retrieve information in very sophisticated and very exciting ways. I gave some examples in my orignal write-up but just another idea that came to mind. A community website might want to preserve a historical memory. The website could have a block with a SPARQL query to which queries for the most popular discussions from three months ago and another block from a year ago.

Opencalias does look very very cool. I'll definitely experiment with the service to see if it helps achieve better results. If nothing else, adding in all the metadata it does will make any mashups you make from the memetrackers all that much richer.

Managing News does look a very cool app. Does DevelopmentSeed plan to sell it to companies as a SAAS (or is it already to some finished state and being used)? Is there a public demo I can use to get a closer look at it?

Anyways, I hope that helps you have a better idea of what I preceive memetracker becoming. Please ask any other questions you might have.

Thanks,
Kyle Mathews

Kyle Mathews

Hi Kyle, (sorry for mixing

alex_b's picture

Hi Kyle,

(sorry for mixing up your second and first name before)

Thanks for the detailed response. I certainly have a better understanding now on how you're visualizing the memetracker. I am excited about the idea of making it pluggable - i. e. that you're planning

... a hook which other modules can use to affect the ranking of content ...

A meme tracking engine/machine learning engine that is independent from UI and other components can indeed be a very powerful tool to have.

As to Managing News: we do offer it as Software as a Service. ATM, we are starting our beta phase. Unfortunately, there is no public demo... if we set one up, I ll let you know.

Cheers,
Alex

This is a great writeup, and

bonobo's picture

This is a great writeup, and the conversation that has come up as a result of this proposal is (to me, anyways) an indication of the strength of this proposal.

The biggest obstacle I see now (and I point it out simply as something to be worked around, not as a reason to not go forward) is:

But really the possibilities are endless for what can affect the memetracker's internal ranking of content for interestingness.

While this is a great problem to have, it's also the best recipe I know of for scope creep :) So, as you see it now, where does this project end?

Given that the end of the SoC doesn't mean the end of the project, what will be the deliverable by the end of the SoC? The four bullets you lay out here provide a great starting point, and for the SoC, on this project, I would recommend targeting a solid framework upon which additional functionality can be built. Really, the FeedAPI provides a great model for this kind of structure.

Two additional points:

  1. Will any other web services be supported? For reasons of scope creep, I'm not necessarily sure that they should, but this should be a choice.

  2. In your writeup, you are focusing on the age of content as a prime factor in weighing the relevance of it --

more recent news is more interesting. Part of the algorithm would be to decay the interestingness of a piece of content as it grows older

while it's probably a little premature to start getting into the algorithm for determining the relevance of content, this doesn't always hold true

Anyways -- I'm looking forward to seeing how this proposal (and hopefully project) develops -- it's a strong proposal.

Cheers,

Bill


FunnyMonkey
Tools for Teachers

These are my planned deliverables

kyle_mathews's picture

These are my planned deliverables: the machine learning algorithms, code to read/write data to rdf / rudimentary UI for administering the memetracker / simple ui for browsing the memes / and then a few hooks to make memetracker easily extendable.

Basically, by the end of the summer, I want the memetracker to meet the needs of my most important use case that of a community type website. That the admin can install the two modules, tell it what types of Drupal nodes to use, set the url for the page to browse memes, add a link to the menu and then presto, they have a working memetracker.

I agree with your recommendation Bill that a solid foundation be set then others (and myself) can continue to work off that. My ideas for application should be taken more of pie-in-the-sky ideas of what the module will be able to do eventually, not what I realistically think will be working at the end of the summer. So when I say the possibilities are endless -- I meant more that using the hooks, a site will be able to tweak the value of different content in anyway they might want to -- not that I'll be doing lots more work. One example, say an commercial social community site sold items as well as supported group discussions. They could track via web analytics which user-generated-content leads to the most sales. This data they could feed into the memetracker to heavily weight in favor of content that leads to more sales.

Where I'll concentrate my work is getting the machine learning algorithms working well and internal data structures correct as these two things will be by far the most tricky / important components (in my view) to get working correctly.

About supporting other web services: My feeling is that RDF is by far the best choice for a web service. I hadn't really considered though using other web services. What use cases do you see that would require a different web service? There will, of course, be RSS feeds provided.

Regarding age of content. You are correct that the age of content is not always important in calculating its relevance. But having that as an option shouldn't be a problem. Each piece of content will be scored on a number of different measures and the scores will be stored separately. Only when you decide to display the content will you combine the different scores into one and then (if you choose) weight the results toward new content. So it'd be easy to create a page which displayed the best memes from all time.

An idea of what's possible (but I'll definitely not be writing this -- this would definitely be scope creep) is to write a module that interfaces the views module with memetracker. You could call the module smart_views or something like that. The module would take nodes outputed by the view and use memetracker to sort / group content as it makes sense. I'm not entirely sure how well it'd work but its an interesting idea.

Kyle Mathews

Kyle Mathews

There is already a

scor's picture

There is already a vocabulary tailored for doing exactly what you need: SIOC. You can describe people and the content they post across several sites. You should definitely have a look at it.

Very cool

kyle_mathews's picture

Very cool -- thanks for suggesting it. That looks like it might fit my needs. BTW, I'd love to have one of my mentors (assuming my application is accepted) be an expert in semantic web technologies. PHP, Drupal, machine learning, aggregation I understand pretty well but semantic web stuff is all new to me.

Kyle Mathews

Kyle Mathews

We are more than 100

scor's picture

We are more than 100 researchers in the Semantic Web here in <a href="http://www.deri.ie/>DERI Galway, and it's where the SIOC project started in 2004. I'm sure we can find some ways to help you out! Contact me personally if you're interested.

Nice work!

nommo's picture

I have been playing with drupal and social networking for a while now - I came up with a concept called meemaps which would pull in content from multiple sources, I tried installing SIOC a while ago but it killed my druapal install somehow, so I haven't managed to give that a whirl. I think the missing element is time... the ability to track changes over time and/or flip to a snap-shot of a different time-frame - I couldn't help thinking of an ajaxy type mind-map visual interface that allows you to view chronologically or by other taxanomic means (tags/keywords or others). I also like the idea of an open-source community management view - the ability to slurp all your flickr/facebook/myspace/blogspot photo's/buddies etc... that was my commercial angle - charging for permanent storage ;-)

I am always reassured when I see people with development skills working on ideas and concepts in the same 'noosphere' tangent as me, as I am not a developer to speak of unfortunatley! Nor do I have time to do another start-up - I am pretty risk averse these days and busy with my day job.

I have been making time recently to work on a drupal based project that aims to be a social networking space for turning ideas into projects into real-world social enterprise solutions. It's about playing with ideas - so this meme tracker would be very useful!

I look forward to hearing more,

Cool idea

kyle_mathews's picture

I like your ajaxy mind-map visual interface idea. I think that's one direction that Exhibit is attempting to go. You could think of my module as a sort of RDF preprocessor for community generated content. With info brought in from Opencalias and the machine learning processing thrown in as well, you'll have a very rich soup of data to work with for creating very interesting visualizations of content on any number of attributes
Kyle Mathews

Kyle Mathews

twitter pattern

SamRose's picture

I like your "twitter pattern" idea above. I think that user_relationship module comes close to the "follow/follow" function. Anyone know if the people behind http://edmodo.com have released their work (my guess is not)?

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

Edmodo

rryan's picture

No they haven't yet. I've used their product in beta and it's thought out very nicely.

Reggie

Posted revised proposal

kyle_mathews's picture

Thanks all for your critiques and suggestions. I've written a revised and expanded proposal at http://groups.drupal.org/node/10276. It includes the information above and more all rolled into a semi-finished form. Any one with a bit of time, I'd greatly appreciate some feedback.

Thanks!
Kyle Mathews

Kyle Mathews

Subscribing

vkr11's picture

subscribing

RSS & Aggregation

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week