QueryPath - just the job for scraping

Posted by budda on July 11, 2009 at 5:05pm

If you're still scraping content from other sites using a mixture of regular expressions and string searches in a HTTP page load then you should check out the QueryPath library!

With a bit of fiddling I've managed to scrape forum posts and extract usernames, dates and content in a small amount of lines without any complex regex.

There's a handy getting started tutorial by the QueryPath author published over at the IBM developerworks site.

Once you've extracted your values in to PHP variables you can use drupal_execute() to create nodes from your fresh content, or generate an RSS feed from the data outside of Drupal, which is what i've been doing.

Can be a great way to migrate a site if you don't have access to the database behind it too. Just scrape away!

New groups

QueryPath - just the job for scraping

Web Scraping

Group organizers

New groups

Group notifications