Recommender Module Performance Enhancement & Drupal for Data-intensive Computing

Events happening in the community are now at Drupal community events on www.drupal.org.
danithaca's picture

The Proposal

This student proposal has two parts. The first part is to enhance performance for the recommender modules via Apache Mahout integration.

The Recommender API module and its helper modules was developed as a GSoC 2009 project. Those modules enable Drupal sites to provide content recommendation services based on users browsing history, Fivestar ratings, product purchasing history, and so on, similar to what http://amazon.com offers. However, I received many feedback from users complaining the performance/scalability issue of those modules: for sites with >1k nodes or users, the modules won't work (using 4GB RAM, running for >10 hours). This is simply not acceptable, because it is those sites with lots of nodes/users who need the recommendation service the most.

The performance issue is due to the following reasons:

  • The recommendation algorithm involves complex matrix computation. By nature it requires lots of CPU/RAM. For those who are interested in the algorithm, please read Amazon.com recommendations: item-to-item collaborative filtering. (Note: there are other algorithms that require less resources, eg, Slope One algorithm, but none of those are used in Amazon, Netflix, Pandora, iTunes Genius, Facebook friends recommendation, etc.)
  • Recommender API was implemented in PHP, which is not optimized for high performance matrix computation.
  • Recommender API runs locally with the Drupal web server, which usually has limited RAM (<2GB) and CPU time.

To fix the performance issue, the general idea is to outsource the recommendation computation to another program (written in Java/Python/C) running on a local or remote machine. I evaluated three approaches:

  • Outsourcing only the matrix computation part to 3rd party program, such as Matlab, R, Octave, NumPy, or Java
  • Outsourcing all recommendation computation to Apache Mahout using database direct access (via JDBC/JPA).
  • Outsourcing all recommendation computation to Apache Mahout via REST/WebServices.

Most users and myself favor the 2nd approach. Specifically, it consists these sub-tasks:

  • I'll write a Java program that uses Apache Mahout to do the recommendation computation. The Java program can run either on the local Drupal server or on a remote computer with better CPU/RAM capacity. And it uses JDBC and Java Persistence API (JPA) to directly access the required Drupal database tables on most JDBC-compliant databases.
  • I'll also write a Drupal module so that users can issue commands to the Java program through the Drupal interface, and then the Java program will pick up those commands and execute accordingly.
  • All the nitty-gritty communication between Drupal and the Java program is handled by Recommender API, and the helper modules (Browsing History Recommender, Fivestar Recommender, etc) just use Recommender API to calculate the recommendations.

For the end users who use the Recommender modules, they need to do the following:

  • First, install the Recommender API module and helper modules on the Drupal server.
  • Second, install the Recommender-Java program and Apache Mahout library on either the local Drupal server or a remote computer, and configure database access so that it can read certain Drupal database tables.
  • Third, run the Recommender-Java program as a daemon service
  • Finally, from Drupal, site admins can issue commands to the Recommender-Java program to compute recommendations, which are written back to the Drupal database for display in the Drupal site.

In addition to the performance enhancement, I'll also work on the following two highly-requested issues:

The proposed work would be targeted to Drupal 7 with Drupal 6 back port.

Extension

The second part of the proposal is to extend the "Apache Mahout integration" idea to a broader usage scenario -- "Drupal for data-intensive computing".

Drupal is awesome for website building, but not so great on data-intensive computing. Computing recommendations is one example. Other examples could be:

  • For e-commerce sites, to analyze historical sales data and make predictions on future sales.
  • For news sites, to use machine learning algorithms to categorize news articles.
  • For biology research sites, to use scientific programs (Matlab, SPSS, etc) to analyze genes data in the Drupal database.

In all these scenarios, Drupal is used to organize and display information, and 3rd party software/script is used for data-intensive computing. What's missing is a framework to help the 3rd party software exchange data with Drupal.

I have proposed such a framework in the first part of the proposal, here I would just generalize that framework to be used not only for Apahce Mahout, but also for other programs too (Matlab, R, NumPy, etc). Specifically, I propose to do the following tasks:

  • I'll write a generic Java class that does the following:
    1) establish connection pooling to the Drupal database via JDBC
    2) data object mapping (via Java Persistence API, so that developers can use nodes, users, etc as objects)
    3) "paging" for slow connections
    4) multi-threading
    5) run as "daemon" service
    6) output CSV file to be fed into Matlab, R, SPSS, etc.
  • Developers can write Java/Jython/JRuby script for Drupal data-intensive computing tasks simply by sub-classing that generic Java class, or output CSV file to work with Matlab, R, SPSS, etc.
  • I'll write a Drupal module so that site admins can issue commands to the data-intensive computing programs within the Drupal interface.

Contribution to the Drupal community

For the first part of the proposal, "Recommender modules performance enhancement", I have received many requests from users, so I know it would add values to the Drupal community. My hope is to give Drupal the best recommender system to be competitive with similar services offered by Amazon, Netflix, etc.

For the second part of the proposal, "Drupal for data-intensive computing", I know there are some initiatives within the community to try to build machine learning and artificial intelligent capability into Drupal, eg, D.A.I.L., Machine Learning API, etc. Again, performance is a big issue if implementing them in PHP. And also it doesn't make too much sense to re-write lots of algorithms in PHP given that many of them are already implemented in Java/Python/Matlab/R. I think the most viable solution is to write a framework, as I proposed, so that it's easy to use 3rd party programs/script for data-intensive computation. With this framework, I hope it would facilitate more innovations on data-intensive computation with Drupal, such as making predictions on sales, text categorization, etc.

About me

I'm a PhD student at the University of Michigan School of Information. My expertise is in recommender systems and machine learning algorithms. I served/will serve as a Program Committee for the ACM Recommender System Conference 2010 and 2011. It is my goal to build cutting edge recommender system into Drupal.

In terms of Drupal involvement, I have participated in GSoC 2009 to develop the initial version of Recommender modules. I have also collaborated with the Drupal.org infrastructure team and implemented the "Related projects" page on drupal.org module pages. My drupal account is at http://drupal.org/user/112233

Time line

  • May 24 - 31st - Get familar with Apache Mahout library, setup development environment
  • June 1 - 21 - Develop the Java program for recommendation computation with Mahout
  • June 21 - July 12 - Develop the Drupal module that issues command to the Java program; add Views support; take "accesslog" as input for Browsing History Recommender module.
  • July 12 - submit midterm
  • July 16 - August 2 - Generalize the "Drupal for data-intensive computing" framework to go beyond Apache Mahout
  • August 3 - August 16 - D6 back port.
  • August 17 - August 20 - polish up and submit final report

Comments

Support for Proposal

aruna.kulatunga's picture

The solution proposed by Daniel is very much an "immediate" need for Drupal sites with customer recommendation functionality. We will be willing to co-support this proposal.

Warm regards,

Aruna Kulatunga
Comunicamos.EU

Great proposal

jolos's picture

I hope this gets accepted. This would indeed be a great addition. I've a bit of experience with Mahout and recommendations. I'm currently using it for my master thesis. Though only the distributed part, which is based on hadoop. It's certainly one of the best open source recommender engines.

Some thoughts about the proposal itself:
1. I think it's important to keep in mind that not all websites will use mysql to store the data. Often larger sites will use something like mongodb. ( and larger sites are probably more likely to use recommendations ) So it should be fairly easy to swap the type of datastorage.
2. I would only focus on drupal 7. Even without the drupal 6 backport this looks already quite a bit of work.

Eager to see how this turns out!

Drupal 6 backport

aruna.kulatunga's picture

Agree in Point 01. However, for the module to be accepted at D.O, it will need mysql at a minimum. On point 02, many large sites are still at Drupal 5.0 if not 6.0. Unlike smaller sites and new comers, the cost of upgrading from 6.0, specially if you have custom code, is prohibitive. Our own sites will not migrate till next year, when Drupal 7 is more stable and the roadmap for drupal 8 is clear. For us, a drupal 6 backport will be essential.

mysql is not a minimum. If

greggles's picture

mysql is not a minimum. If you think it is show some documentation to that point. It would be great if mysql were supported to gain wide adoption, of course.

I think GSOC should focus exclusively on Drupal 7 this summer. It is the current main focus of development for everyone and we should support that.

thanks for the suggestions.

danithaca's picture

@jolos, aruna.kulatunga, greggles: Thanks for the suggestions. I'll use Java Persistence API to access any JDBC-compliant database. I'll make a personal commitment to do D6 backport. If I'm not able to get it done during the summer. I'll continue to do it after GSoC.
Thanks.

Great Proposal!

angusmccloud's picture

I am 110% in support of this expansion, as it would fix the exact issue our site is currently faced with (computations of 1500 users and 4k nodes currently takes about 5 hours, and is growing exponentially).

We've come up with a few hack-around solutions, but this is what we've truly been looking for.

The one other thing I'd like to see added (which I can continue to create my own code for if necessary) is "incremental updates", or at least "predictions for new users". I don't want to have to rebuild the whole table to get predictions for a new user.

Again -- I'm 100% behind this project!

Great Proposal

g089h515r806's picture

I totally support this proposal.
Recommender API will be another great module like Apachesolr integration module if we can solve the performance issue.
There are many drupal sites, many commerce drupal sites that need a better recommender.

我的drupal博客Think in Drupal

Full Support!

nikkubhai's picture

I fully support this proposal.
Enhancing performance for the recommender modules via Apache Mahout integration will of great use.

This looks very promising. I

scott falconer's picture

This looks very promising. I support the proposal.

Great Proposal - my full support

zuriwest's picture

Very promising. I fully support the proposal.

I both support this and am

coderintherye's picture

I both support this and am willing to commit time to review code or provide mentorship if needed.

Drupal evangelist.
www.CoderintheRye.com

Thanks nowarninglabel! I

danithaca's picture

Thanks nowarninglabel! I don't have a mentor now. It'll be great if you agree to be my mentor for this proposal!

This sounds like a great

laura s's picture

This sounds like a great project. (It also overlaps quite well goals of the CRE idea I posted last week.)

Laura Scott
PINGV | Strategy • Design • Drupal Development

Recommender engine in PHP / use Batch API

allisterbeharry's picture

I know one common perception about interpreted languages like PHP is slow performance - but have you tested your assumptions about your algorithm's performance under PHP, by say writing a C version and benchmarking it against the PHP version? Could it be that your algorithm itself needs optimization and is not so much performance dependent on the underlying language? I have limited knowledge about this but AFAIK when running pure data algorithms PHP and Python can/should be just as fast as compiled code. It's primarily doing I/O where interpreted languages that can't talk to devices at a system level are at a disadvantage. The one reason why a Java implementation should be faster than a PHP implementation of an algorithm would I guess would be the 'hotspot' optimizations the JVM does - but couldn't you accomplish the same thing in PHP using an opcode cache like APC? If someone knows about this stuff they can correct me.

At any rate what about writing a PHP extension which implements your algorithm in C that can be called from your module? The advantage is that you don't require Java to be running anywhere nor are you tied to any particular database. You could then step from a PHP extension that runs in-process to an external daemon that wraps the library that can run on a separate machine. So if somebody really needs to scale out they can run the recommender engine on a separate server - this is the approach Solr and other search engines take AFAIK.

So implementing the algorithm as a C library that can be called in-process as a PHP extension and wrapped by a daemon that can be called across the network can allow your engine to scale but then it's not really a Drupal project totally.

Instead of venturing into another language/platform, since recommendations need not be generated in real-time you could keep your implementation in PHP and the module could be refactored to run using the new D7 API for long running tasks. (I've never used the so forgive my ignorance if you're already using the Batch API.) The engine could be scheduled to run overnight then to process the day's transactions.

Thanks for the comment.

danithaca's picture

Thanks for the comment. Recommender API currently uses BatchAPI and Drush, but I received user feedback saying that it took >10hours to run and finally ended up getting killed by the system. So BatchAPI/Drush don't solve the problem.

Using C implementation is an interesting idea. Four considerations though:

  1. Even using C implementation, the algorithm still takes lots of RAM/CPU, same as any other algorithms using matrix computation (even in C implementation). In Mahout, it'll require >2G RAM. In Amazon.com, they used clusters to do this type of work. Considering many Drupal sites are hosted on VPS/shared hosting with RAM constraint, it might be better to offload the computation to another server (or even a desktop). Besides, Mahout supports Hadoop-cluster, so potentially using Mahout can run on clusters too.

  2. I'm a little reluctant to reinvent the wheel considering there is open source code in Mahout already.

  3. The second part of my proposal would help other 3rd party script/program to use data in Drupal for data-intensive computing purpose in addition to calculating recommendations. Writing C extension will only limit to computing recommendations.

  4. Writing C extension would have platform dependancy.

For those reasons, I still prefer the "Mahout integration" approach.

Great proposal

Eli Baskin's picture

This is a very good proposal and I fully support it. It will help establish Drupal as number 1 solution for e-commerce websites.

One point though, it is important to make sure this module works for Drupal 6, since most e-commerce websites are operated by Drupal 6.

The GSoC this summer is

danithaca's picture

The GSoC this summer is targeted to D7. So it has to be D7. But I'll take D6 back port as a personal commitment, and will make sure it works for D6 at some point (if not this summer).

E-commerce systems on Drupal 6

alippai's picture

There are no E-Commerce systems on Drupal 6. Ubercart and E-Commerce must be a joke for a real player, who uses Mahout. Let's concentrate on Drupal Commerce integration for D7, it's just more mature and not f*cked up from the ground :P

Mahout rocks. :)

I fully support this proposal

ann_meredith's picture

I fully support this proposal and look forward to the D7 integration.

Is this project happening?

angusmccloud's picture

Dan,

Just curious if this project is indeed happening over the summer? My group is hoping it is since we want to be able to keep using this module -- but if it's not that's not the end of the world and we'll start our hunt for a new recommendation algorithm.

Thanks!
-Connor

Google Summer of Code 2011

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week