Web Scraping

This group should probably have more organizers. See documentation on this recommendation.

Welcome! This group is devoted to "web scraping", which I will define as capturing data out of more static web pages. This group is secondarily concerned with importing any scraped data into Drupal.

PHP Web Scraping Libraries

Posted by lolandese on February 29, 2024 at 2:12pm
Last updated by lolandese on Fri, 2024-03-01 09:45

When writing custom or contrib modules that are aimed at web scraping, it makes most sense to use a library, without reinventing the wheel. Furthermore, it is advised to write a generic module first that handles scrapping requests and is capable of mapping them into fields of a specific content type, kind of the way the Feeds module does that. Targetting a specific site with specific selectors could then extend on that, either through a UI or a separate module's code.

Looking for a tech co-founder/cofounder for social impact project

Posted by socialnicheguru on October 31, 2022 at 11:45am

Hi,

I am reaching out to see if there are any front or back end devs who love Drupal as much as me who might want to team up.

I have developed a platform but need a tech co-founder who wants to help on a social impact platform

Drop me a note and let's talk more.

I am in San Francisco, Boston, and Atlanta.

You can be anywhere on the planet if the fit is right.

Social Niche Guru

Good source for web scraping help?

Posted by speretz on November 7, 2013 at 6:07pm

I currently have a hosted drupal site and somehow they are able to take an RSS feed and pull it into Drupal as a post but with the entire contents of the news story instead of just the syndicated content. Any clues how this is happening or can anyone point me in the right direction? The page is at www.marimoninc.com/newsroom and the feed source is http://www.usa.canon.com/cusa/pressReleaseRss.action but it goes through feedburner first.

cupcake.js

Posted by mikejabber on March 31, 2013 at 5:22pm

A new efficient HTML5 Web storage management js library is out now. This is called cupcake.js and is coming with rich set of features. You can check this at : http://www.rivindu.com/p/cupcakejs.html

Example Web Scraper - Feature

Posted by mitchell on February 15, 2011 at 7:36pm

An example web scraper is now available. It ties together the modules that twistor and I released recently in an easy to understand (hopefully) demonstration. The whole suite is similar to SIMILE's Piggy Bank and Solvent workflow and modular architecture, in that users develop their queries in their browser and then run their configurations in a web service. It's entirely Drupal based (except for browser addons), so it should be an improvement on writing custom scripts that upload data to Drupal.

QueryPath - just the job for scraping

Posted by budda on July 11, 2009 at 5:05pm

If you're still scraping content from other sites using a mixture of regular expressions and string searches in a HTTP page load then you should check out the QueryPath library!

With a bit of fiddling I've managed to scrape forum posts and extract usernames, dates and content in a small amount of lines without any complex regex.

Web Scraper Recommendations - Extract Mailing Address Data?

Posted by EvanDonovan on February 18, 2009 at 6:42pm

We at UrbanMinistry.org would like to create a directory of church mailing addresses, with Google Maps/Earth capability. To facilitate this, we're looking for a Web scraping program that would extract mailing address data from Web sites. Does anyone have any suggestions?

Import an event email into a content type

Posted by socialnicheguru on December 22, 2008 at 12:00am

I would like to find a way to have an event email sent to me via a specific email address and have it upload to a specific content type, myEvent.

It woudl be great if event date, time, place could be automatically extracted from the email.

has anyone done anything like this?

Chris

Module for mass importing an existing site?

Posted by nedjo on July 12, 2006 at 5:39am

I've been musing about the possibility of building a module for mass importing contents from existing sites, as an aid for migration to Drupal. I came across a potential model, an import utility for the CMS SAPID, http://sapid.sourceforge.net/en/doc/import/, code at http://prdownloads.sourceforge.net/sapid/migrator.tar.gz?download. The approach combines web crawling with regular expression-based data extraction. Users input the wrapping code patterns to search for, e.g., main content might be wrapped in and or maybe a div with a given class name.

MIT's Piggy Bank, Solvent for web scraping

Posted by dado on July 9, 2006 at 6:31pm

Has anyone seen these products from MIT's Simile project?

Piggy Bank - "Piggy Bank is an extension to the Firefox Web browser that turns it into a “Semantic Web browser,” letting you make use of existing information on the Web in more useful and flexible ways not offered by the original Web sites."
Solvent - "Solvent is a Firefox extension that helps you write Javascript screen scrapers for Piggy Bank."

I have only begun to check this out but it is looking very cool.

Subscribe with RSS