New memetracker built using Drupal / OpenCalais

Posted by kyle_mathews on July 1, 2008 at 6:22pm

I ran across a new memetracker built on Drupal/OpenCalais and custom code. It looks very cool. Check it out at Polymeme.com

They're tracking some 25,000 blogs and apparently running them all through OpenCalias as part of the process for finding "memes" along with some human editorial touch.

The creator of polymeme writes more about the site on his blog:
http://evgenymorozov.com/blog/?p=397

BTW, if anyone involved with polymeme is around, I'd invite you to head over to the Memetracking and Content Recommendation group and help us build memetracking technology everyone can use. It looks like you'll have a lot of great ideas.

Comments

Looks cool. I wonder how he

Posted by Doktor.Science on July 1, 2008 at 7:30pm

Looks cool. I wonder how he built it.

Hello everybody! Site is

Posted by meramo on July 1, 2008 at 9:40pm

Hello everybody!
Site is based on custom script, which aggregates feeds from thousands of blogs (added both manually and automatically), also it discovers previously unknown blogs and adds it to database. Custom algorithms are used to determine which memes are most popular (these algorithms are traditional enough - number of links pointing to the article and so on).
Then Drupal comes to stage. Feed API imports popular memes and then they go through OpenCalais. OpenCalais automatically adds tags using Reuters database.
Most of feeds are relevant enough, but some editorial work is needed now. For example, memes in Chinese language cannot satisfy our need in information :) In future we plan to avoid editorial work completely. Finally, Node Queue is used by editors to publish memes.
It is very schematic, but it is as it is :)
We`ll write more detailed study later. And now you can just ask questions :)

Hello, MeRamo, Thanks for

Posted by bonobo on July 1, 2008 at 10:53pm

Hello, MeRamo,

Thanks for joining in here!

The site looks great -- I'll be looking through it in more detail, and then ~~flooding this thread with questions~~ asking a question or two :)

Cheers,

Bill

FunnyMonkey
Tools for Teachers

FunnyMonkey

First, the site looks great!

Posted by Doktor.Science on July 2, 2008 at 12:34am

First, the site looks great! Are you using Panels?

Second, so you don't use the FeedAPI to pull in the content and aggregate and find the memes within Drupal? Does your original aggregator work faster or better than Drupal? It sounds like you are using two databases. The first for the original aggregation and find memes, and the other is Drupal's database?

feedapi vs. custom

Posted by glebon on July 2, 2008 at 10:53am

The meme finding script is custom, feedapi seemed to be an overkill for that. we'll be improving the algorithm constantly, and it just made sense to keep everything in a small lean extensible script while we experiment with it. Feedapi comes in a bit later, importing the results from that script into our main site's editing queue.

We were following closely the meme tracking discussion but since it is our first serious effort at this we didn't contribute there yet. We'll post more info about our findings and observations there later.

P.S. we've also submitted an announcement to Digg, all friendly diggs would be appreciated :)

Does the editor determine

Posted by Doktor.Science on July 2, 2008 at 12:40am

Does the editor determine which link should be placed at the top? For instance, http://polymeme.com/node/44723, the top link goes to a particular blog. Does the editor determine that this blog is most popular or is it that this blog was the most "popular," with the most incoming links on the topic?

yes, we track incoming links

Posted by glebon on July 2, 2008 at 9:28am

yes, we track incoming links to individual blog posts.

Thanks for responses! Yes,

Posted by meramo on July 2, 2008 at 9:07am

Thanks for responses!
Yes, we are using Panels almost everywhere, for the front page, section pages, sidebar etc. And Mini Panels of course for tabs.
Second - yes, we do not using Drupal for finding memes, just for publishing and analyzing with OpenCalais.
And think our "technical guys" will explain in more detail about our script and algorithms :) I`m just responding for Drupal part mostly

more questions. . .

Posted by kyle_mathews on July 2, 2008 at 7:14pm

First off, I'm very impressed with what y'all have created. When Memetracker is more mature, I hope there will be 1000s of sites very similar to yours covering every topical area under the sun.

How do you scrape content from pages you don't pull in via a feed? For example, If a blog links to another post, how do you go and grab the content? It's easy to get the title, but the content it seems would require a custom screen scraper for each site. The use case I'm imagining is say a memetracker is tracking the conversations of 200 or so members of an organization. One writer links to a post on an external site and discuss the post. Ideally, memetracker would grab that post and place it as the headline within the meme and the internal post would be listed as part of the discussion.
I'd love more details about how you find memes! Do you only use interlinking to find memes Or do you use some other method in addition for calculating how "close" different posts are to each other? I'm using the fulltext search built into mysql as suggested by FiReaNG3L which works very well.
What language is your other script written in? php? I can see some advantages offloading aggregation and meme finding -- a script can run as a daemon and be constantly running in the background much like what development seed is doing with managing news as described here Plus moving off aggregation to a different database would reduce clutter in the drupal database. With tracking 25,000 blogs, you must be pulling into 50-100 thousand posts a day!
How do you determine the interestingness of memes? Figuring this out is my next major task now that the clustering of content is (more or less) working. I'd love actually to read the script you've written. I've thought of a few methods such as counting incoming links but also using a popularity score for the source -- i.e. how many times in the past have articles from this source been clicked on? Also something I'm calling click momentum -- what percentage of clicks did this meme receive since the last cron run. So if the articles that make up a meme received 15% of the clicks since the last meme would get a score of 1. Memes with 7.5% would get a score of 0.5 and so on.

These different interestingness scores would be weighted by the admin to met different needs.

Anxiously awaiting answers / full writeup,
Kyle Mathews

Kyle Mathews

Hi, thanks for all the

Posted by glebon on July 9, 2008 at 11:43am

Hi, thanks for all the questions, and sorry about the radio silence - we're going through the first days of the site's public existence and you know how it is - fixing and preventing problems around the clock :)

First off, the site is not yet a general-purpose meme-tracker - it is a hybrid thing, where editors may pick and chose stories placed on the
front page. We are not focused on meme-tracking (i.e. tracking the most linked-to articles and posts) per se - those may tend to be a result of hype and not objective "interestingness" in the areas we're focused on. So we identify potentially most interesting items through the number of google blogsearch links and del.icio.us bookmarks (both are easy to get through their api) and then editors choose the items that merit attention.

Also, Drupal is used mainly for presentation purposes and editorial tasks - distributing posts between sections, blocks, panes, scheduling publishing, etc. We use nodequeues which work really well with our panels setup and use a custom CCK field called "publish_date" for scheduled publishing (the views have a filter to show only nodes which have a "publish_date" in the past). Media embed module came in handy for managing Flickr image display. We've reworked the "taxonomy browser" module to provide the "mypolymeme" functionality - its not an ideal solution yet, and we will update it soon. "related links" display is done through a custom module too, which happened to be based on a wordpress plugin FirstRss.

Second, the criteria aren't set in stone yet: we're experimenting, and the criteria have to be different for each topic, it seems: Law blogs
are different in the way people link to their favorite pieces from economics or science blogs, so a degree of flexibility has to be
built-in for the script to be effective - we're working out the best algorithm with this.

And yes, we used php for the meme-tracking script, which is completely separate from Drupal.

What we will do next is release the code for both the meme-tracking script and of the publishing part of the site as a package (a Drupal distribution). We will be more than happy if people can reuse and improve our code - although it may be a bit specific to this site's purposes. In any case, we are preparing a full write-up - should be ready in the next week or so.
Meanwhile, i'll be posting updates on our team blog - http://feeds.feedburner.com/polymemeteamblog.