Memetracker architecture

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
kyle_mathews's picture

Only local images are allowed.

V1 of UML Drawing
V2 of UML Drawing - added color to Drupal specific code, added database access layer, added content_array parameter to memetracker->get_memes()

Memetracker will be built as a PHP library and the Drupal memetracker module will integrate with it. This is so memetrackers can be trivially ported to other PHP cms' such as Wordpress or Joomla or even have it be the basis for a standalone memetracker by integrating it with an rss library such as Simplepie. By building the memetracker as a standalone library, memetracker will attract a much larger audience of developers and it'll be easier to test write tests for and to maintain.

How the memetracker module will use the memetracker library:

On cron, memetracker module calls check_new_content(). That function runs through all its content_sources calling get_content() on each. Each content_source knows how to check for new Drupal content whether they be internal nodes or new content brought from feedapi or some other source. For each new content created since the last cron job, a new content object is created and stored.

When the memetracker module wants to refresh its memebrowsing page it calls get_memes(). get_memes() calls get_memes() in the classifier passing it an array of content objects. The classifier uses various machine learning algorithms to identify memes and rate their "interestingness." The get_memes method returns an array of memes. Memetracker then takes this array of memes and passes it to view code which generates a new memebrowsing page which will be cached and displayed until the next time the memes are refreshed.

I'm also looking at integrating memetracker with nodequeue as suggested by catch. I've only briefly looked at nodequeue but it'd seems as creating a new smartqueue module for memetracker might be a good plan.

Machine learning algorithms will be given sane defaults so most users won't need to change anything. For advanced users, the machine_learning module will provide an admin screen to tweak the machine learning settings.

Please discuss. What weaknesses/strengths do you see with my making memetracker a library not a native Drupal module? What other questions do you have? I'd particularly like to hear your thoughts about integrating memetracker with nodequeue.

Comments

Regarding drupal module

jgraham's picture

Regarding drupal module versus PHP library; I think your for a longterm goal having a separate library for the machine learning bits is a good plan. Given that this is GSoC project I think writing it as one standalone module and then abstracting out a library later may be a more solid approach. This will help to ensure that the project doesn't get bogged down in abstraction details and produces something functional by the end. As long as you code with this in mind I don't see breaking out the machine learning bits at a later date as an issue.

Alternatively, keeping the machine learning portions abstracted from the start can help to clarify the interface to the machine learning bits a bit better. This may help the longterm extensibility of the machine learning portions of the project. The use of multiple different algorithms for calculating memes will be useful and allow for much experimentation. Having this abstracted and exposed early may help. The machine learning interface will need appropriate versioning to ensure clear migration paths or issues from older to newer versions. If you do approach it as a standalone library for PHP, are there any needs for internal data storage? By integrating now and separating later you can avoid some of these, or have I misinterpreted the diagram/design eg. does the machine learning bit need to store any data or is that left to the drupal module?

Balancing the short term versus longterm benefits of both approaches needs careful consideration. A very clear and well documented API for the machine learning portions is critical to the longterm success, and quality of memes returned so I understand the interest in abstracting early.

Have you identified potential algorithms or methods for the machine learning bits yet? Have you (re)searched if there are existing machine learning API's or code that would be useful for this portion? I would suggest identifying one or two algorithms and moving forward with those. I think the machine learning algorithm portions will be the final decider regarding the quality of the meme tracker, but the algorithms need a solid architecture to sit on top of if this project as a whole is going to be successful. Changing an algorithm, or adding new ones later should be as easy as possible. Regardless of one drupal module or a drupal module an PHP library I would strongly suggest identifying or implementing only one algorithm (initially) and ensuring the architecture is solid and ready for implementing different algorithms at a later date. Said differently, a solid working architecture is more important now, and in the longterm, than the initial quality of the machine learning algorithms.

get_content() needs to have a well abstracted, modular interface to allow for "drop in" content code types or methods later. There are several other modules that are abstracted well these should be used as guides.

Just to clarify the diagram above, everything except for "<> machine_learning_api" is provided vie the proposed drupal module correct?

Can you explain a bit more how "meme" and "<> content" interact or will be used?

As long as get_memes() has a well defined object and/or return type the abstraction process should be fairly straightforward. ie. as long as your input data structure to get_memes() is well defined. Keep in mind that different algorithms may need different bits of data.

It seems that nodequeue is a great way to implement/present the memes. However, the presentation layer, time permitting, should be abstracted to a pluggable layer.

I hope that all made sense.

Since nodequeues are exposed

catch's picture

Since nodequeues are exposed to Views, I reckon this handles making the presentation pluggable by itself, and almost for free in terms of developing the project. Via Views 2, you'll get memes as RDF, widgets etc. via the other extant GSoC projects :)

Allowing the flexibility to have memes be pushed into a simple block or some future implementation down the line would probably be good too though.

nodequeue

kyle_mathews's picture

yeah, the more I think about it, nodequeues seems the way to go. It seems the simplest (and most logical) way to push memes into views (and from there all your Drupal memetracker dreams will come true).

I have a few concerns about nodequeue -- is it possible to create a meme queue which holds the headline and then two subqueues, one for discussion nodes and one for related nodes?

Also, how flexible is nodequeue themeing? Would it be possible to create an interface similar to techmeme.com using node queues?

For testing, I'm just going to create a very simple view that'll spit out the memes in a simple manner. But down the road when I start refining the presentation layer, answers to the questions above will be important.

Kyle Mathews

Kyle Mathews

responses

kyle_mathews's picture

Thanks for the great feedback Jeff. I'll run through your questions.

First, I didn't explain this originally (fixed now -- added a note in the diagram) but blue is Drupal specific code and tan is the PHP Memetracker Library. I've changed colors around some to clarify things. When a developer ports the library to another CMS, they'll need to write new implementations for the different interfaces that are specific to their CMS. So the Drupal memetracker module instantiates the memetracker class then uses methods there to generate memes. Once the Drupal memetracker module is returned an array of memes, it'll figure out how to display it.

The machine learning classes and content classes will need to store data. I've added a database_access_layer interface. I haven't thought to much about database access so am not really sure what's the best way to proceed there. Any ideas?

I like your idea of focusing on the architecture first then algorithms later. That seems like a wise plan.

I have looked at a number of algorithms. Two simple algorithms I could get up and running quickly are naive bayes and the clustering algorithm FiReaNG3L implemented in his Eureka! project (do you think I could borrow your code FiReaNG3L as a basis for my clustering code?). Using those two algorithms, I'd have something resembling a memetracker -- close enough in any case to work out kinks in the architecture.

On the difference between the meme class and the content interface:
Classes that implement content_source create content objects. So content_source_drupal_nodes knows how to detect when new drupal nodes are created and creates new content objects. A meme is composed of a headline, content that discusses the headline article (i.e. articles that link to the headline) and content that is related to the headline (i.e. articles that discuss the same topic as the headline but don't link directly to the headline article). Look at http://techmeme.com -- I borrowed my terminology from there.

So when an array of content objects is passed into the get_memes() method and a new meme is created, the content object that is selected as the headline of a meme is stored in the meme along with the content objects that discuss the headline or are related to the headline.

Kyle Mathews

Kyle Mathews

"The machine learning

jgraham's picture

"The machine learning classes and content classes will need to store data. I've added a database_access_layer interface. I haven't thought to much about database access so am not really sure what's the best way to proceed there. Any ideas?"

This is partly why I suggested to write it all as a drupal module first and then at a later date look at abstracting the library functions to be usable in other CMS platforms. This avoids some technical details that should probably be placed out of scope for this iteration of the project.

The diagram key helps clarify things considerably. What is the text that is jumbled up supposed to say?

Regarding the get_memes() method, have you considered meta-data that the requesting code may be interested in that has little or no value to the meme AI stuff? For instance in Drupal the drupal module may want the nids returned, but those will have little, if any, value to the memetracker. If I'm understanding your explanation correctly; drupal will hand off an array of nodes, content objects will be created reflecting these nodes, a new meme will be created using the content objects. This has not adressed the situation where the drupal module will want to point back to the originating node. Will content objects store meta-data that the originating CMS may want associated with them? Or am I missing something?

I think one way around the

jgraham's picture

I think one way around the storage issues is the following;

You have two separate drupal projects machinelearningapi and memetracker

The machinelearningapi can be a reference implementation of the machinelearning php library. It can handle the storage routines within Drupal by being a drupal module. This will provide an abstracted interface to the memetracker drupal module that will allow the backend php library to change while keeping the drupal facing interface(s) constant and keep storage details isolated to the appropriate locations. The machinelearningapi module will not actually do anything, other than install some tables, by itself.

Hope that helps.

content_id = node id

kyle_mathews's picture

When a content object is created, it'll just use the node id. Another CMS that's implemented the Memetracker library would use their equivalent to Drupal's node id when creating content objects. But when Drupal gets a content object, it doesn't have to know to know necessarily what node the content object is associated with. Each content object has a get_title() and get_body() method etc. The advantage of this is you could write an implementation of content_source and content for non-drupal systems but that would still integrate seamlessly into Drupal. For an Drupal intranet for example, you could write a class that pulls in documents off another system as they are created. The Drupal memetracker module wouldn't know the difference.

The jumbled text isn't really anything. I'm using a UML editor on Linux that only sorta works. Intelligent spacing of elements apparently hasn't been implemented yet. The jumbled text is labels for the lines from the meme class to the content interface. Memes store its content as content objects hence the connecting lines indicating composition.

Kyle Mathews

Kyle Mathews

Just adding a link to your

jgraham's picture

Just adding a link to your application so there is enough background information. GSoC Application

Using existing tools

jasonwhat's picture

This sounds amazing, very exciting project, and these are some great first steps.

I'm wondering about the decision to develop your own php library vs using existing tools. For example, could yahoo! pipes or AidRSS do some of the heavy lifting. What about geocoding? Does it happen in Drupal, in the php library, somewhere else? Also, have you looked at the method being used for Managing News? Here is a good post about their python daemon.

Existing tools don't build

kyle_mathews's picture

Existing tools don't build memes. Yahoo Pipes and AidRSS do cool stuff but they don't implement the machine learning algorithms necessary to create and rank memes from content.

I'm thinking I might need to build some sort of daemon to process things. Downloading feeds and creating memes will place a big load on the server and I'm afraid (as the Developing Seeds folks wrote) that cron might croak with all it'll have to do each run. With a daemon things can be processing constantly in the background.

I haven't looked yet at geocoding. At first thought though, it doesn't seem to relevant to most memetracking tasks. It might be a nice add-on at some point.

Back to your point of leveraging existing tools. I thought the other day about using technorati (or one of the other news aggregating sites) to do the hard work of finding new content about the topic of a particular memetracker. If a memetracker was following say real estate news, you could do a quick search at google news for "real estate" and add the rss feed from that search to your memetracker.

Kyle Mathews

Kyle Mathews