FeedAPI Add-on: FeedAPI keyword filter module – is such an add-on necessary?

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
milosh's picture

There are many people, who ask a possibility to filter feed items based on keywords in order to exclude them from processing (or possibly based on some other criteria like on timestamp, etc.). This issue has been raised under most aggregator modules and such feature request is in the pipeline of FeedAPI project as well:

FeedAPI:
http://drupal.org/node/160692

The usual recommendation for such problem is to relay on standard syndication and use Views module afterwards for filtering.

While this gives us the required end result, it may not be always desirable, because as a result of this recommendation a lot of feed items will be stored in the database but they will be newer used in the site. To give an example – my aggregator_item table has stored 12,630 feed items (and I do clean this table up regularly), but only 18 of these items are actually used in the site at the moment. Therefore less than 0.15% of the stored feed items are actually used!

I personally would choose not to store unnecessary items in the database at the first place rather than use the Views module for filtering out necessary items from a huge database. I would prefer a rough filtering of the feed items before they are stored into the database (and possibly use Views based fine tuning afterwards).

Therefore my proposal is to have an official add-on module for FeedAPI to solve such keyword filtering scenario.

As this seems to be basic feature, then maybe someone else has written a module already?

Anyway, as I haven’t (yet) found such module, I have written small FeedAPI add-on module (very alpha) for keyword filtering for my own purposes and which follows the developer’s suggestion to use ‘feedapi_after_parse’ hook. (The initial code is here).

If the community feels such module necessary (and if there is no such tool available already), I am willing to take my initial script further (or contribute to already available tool, if such exists). If such a module is not desirable, then I will keep this code my own custom module and won’t upload it in order not to confuse others.

More specifically about the proposal – if it would be desirable module, then I would modify my initial code fore something more general, so people could be able to store and reuse filters and apply several pre-stored filters simultaneously on the feeds by weighted order. Probably (but not certainly) I would generalize the code even further to provide a hook_filter_feed(), so the filtering could be carried out not only by keywords, but by any imaginable filter criteria that can be applied on feed item (like timestamps, see here). On the other hand, it seems reasonable to me to limit the module’s duties on basic after_parse filtering only, because most of the item modifications can be carried out by FeedAPI processors.

What are peoples’ thoughts on the subject?

Comments

Great that you stepped

alex_b's picture

Great that you stepped forward with a solution for a frequently asked for feature. Is there any reason why you could not publish the feedapi add on module from http://drupal.org/node/160692#comment-830849 as an independent project? Much like Feed element mapper (http://www.drupal.org/project/feedapi_mapper) is an independent add on module for feedapi.

As for extensibility: I'd say that's important. Introducing a hook_filter_feed() sounds like the way to go.

Alex

Is there any reason why you

milosh's picture

Is there any reason why you could not publish the feedapi add on module from http://drupal.org/node/160692#comment-830849 as an independent project?

No, there is no reason not to do that. :)

However, I would like to see people's opinion, whether such module is necessary, before rushing into it -- my full-time job is not related to IT at all and starting a new Drupal project i.e. becoming project maintainer would be taking a huge responsibility.

+1 for this module

Boris Mann's picture

Sounds great, and making it a separate add on module is the right way to go. Rather than "just" keyword filtering, it could be the basis for pre-filtering in general.

Let's get this module started and then we can figure out other features from there :P

Working on beta

milosh's picture

Just to note that I have written beta for the module.

The filtering is implemented by hook_itemfilter(), and several different types of filters can be run on top of each other. Currently there is although only keyword filter, which runs on [description] field.

I am currently testing the module, and hopefully will upload it soon :)

The module is uploaded to Drupal

milosh's picture

The module is uploaded to Drupal.

See it here: http://drupal.org/project/feedapi_itemfilter

Commenting some questions

milosh's picture

I copied a question from http://drupal.org/node/160692, as here this discussion seems more relevant:


eyecon - May 28, 2008 - 18:17

This looks like a great idea. I have a few questions or concerns. These will help me to make some decisions.

Currently, I am filtering with an external script that runs after each cron run is completed. All feed items are unpublished. The script publishes the items that match. I am passing RegEx directly to mysql. I have update turned off. My assumption is that mysql is faster than php.

The downside is obvious. With a large number of feeds, I am storing a large quantity of unpublished items that are of no interest. The upside (I think) is that I am only downloading the items once. Would I be correct that the filtering module will continue to download, parse and filter the same unwanted items until they expire from the feed?

Have you tested this against a large number of feeds to determine if it will cause the cron run to time out?

Good questions! :)

For the first part the honest answer is I don't know whether mysql is faster than php or not. But I would agree with you that mysql is probably faster. My own problem was that I have a very specific interest in the news, so about 0.15% of all the feed items were viewed in the site. This caused mysql database to grow and personally I didn't like it -- it was quite painful to download the database for back-up purposes. But this was personal dislike of unnecessary items in the database rather than efficiency-based decision.

For the second part -- the Feed is always downloaded and parsed during the cronjob (actually even then when the feed is just modified without making any refresh on the items). The FeedAPI Item Filter module does not require additional parsing by itself, it just runs immediately after the feed is parsed. So on both cases all the items will be downloaded by FeedAPI until the items are expired. But it is true, that with FeedAPI Item Filter module there will be additional filtering routine on the items every time the feed is parsed and this requires additional resources. Based on assumption that your custom script runs also every time the cron job runs, then in my view this part of the question reduces down to the first question -- which is more efficient the mysql or the php.

(Edit: BTW, what did you mean by "update turned off"? If this turns off the FeedAPIs feed parsing, then this probably may turn the FeedAPI Item Filter also off for updates. I'll try to find it out, this needs a bit debug'ing, how the "update off" affects the FeedAPI. But the point is taken -- maybe the hook_after_parse() is not a good place for running a filters on feed items. I might suggest a better place for hooking filters in FeedAPI module after some testing.)

For the third question -- the module has managed pretty well with my site with 12 feeds, but the heavy load remains to be tested.

ag_prasanna's picture

Can you please let us know,as we are using 6.x and this will be very helpful to us.
Thanks

I also need a keyword filter for D6.X

lameei's picture

I also need a keyword filter for D6.X. would you please somebody help me.

how to use views to filter?

lameei's picture

milosh@drupal.org

you have written that :

"The usual recommendation for such problem is to relay on standard syndication and use Views module afterwards for filtering."

I need filtering RSS feeds right now and i can't wait for the module to come out. would you please tell me how can i filter them using views?

Views option

Matt V.'s picture

This is probably too late to be of much help, but the idea of using Views as a filter would mean you'd let FeedAPI create nodes for each feed item, but then use the filtering functionality of the Views module to only display the nodes you want viewable by end users.

The advantage I see of filtering on the front-end is that you save database space by not storing feed item nodes that you don't plan on displaying anyway.

spawinter's picture

Hi,

Thanks for providing this great news filtering module. I have not understood why there are 2 submodules when I install "feedapi_itemfilter-6.x-1.1":
1) FeedAPI items filter.
2) FeedAPI keyword filter.

Well, the second submodule is clear, it allows to enter the keywords in the feed's setup. But how does the first submodule is supposed to operate? Do I need to install some additional modules for it to work?

A second question is could you please explain what exactly is the purpose of "weight" (Control the execution order. Filters with lower weights are called before filters with higher weights.)? It now shows -15, but I do not see any list of choices to assign this "weight" to.

Thanks.

Does feedApi filter

fabrizioprocopio's picture

Does feedApi filter module
works for Feeds module too?

Need this module for Feeds

kiev1.org's picture

please

keyword filter module for Feeds

gtkakega's picture

Anyone know if a similar keyword filter module exists for the new Feeds module? I want to be able to filter out feed items based on keyword. For example, I want to setup an RSS feed and only want to process feed items containing the keyword 'education' in the feed item's title or body.