Welcome! This group is devoted to "web scraping", which I will define as capturing data out of more static web pages. This group is secondarily concerned with importing any scraped data into Drupal.
PHP Web Scraping Libraries
Last updated by lolandese on Fri, 2024-03-01 09:45
When writing custom or contrib modules that are aimed at web scraping, it makes most sense to use a library, without reinventing the wheel. Furthermore, it is advised to write a generic module first that handles scrapping requests and is capable of mapping them into fields of a specific content type, kind of the way the Feeds module does that. Targetting a specific site with specific selectors could then extend on that, either through a UI or a separate module's code.
Read moreLooking for a tech co-founder/cofounder for social impact project
Hi,
I am reaching out to see if there are any front or back end devs who love Drupal as much as me who might want to team up.
I have developed a platform but need a tech co-founder who wants to help on a social impact platform
Drop me a note and let's talk more.
I am in San Francisco, Boston, and Atlanta.
You can be anywhere on the planet if the fit is right.
Social Niche Guru
Read moreGood source for web scraping help?
I currently have a hosted drupal site and somehow they are able to take an RSS feed and pull it into Drupal as a post but with the entire contents of the news story instead of just the syndicated content. Any clues how this is happening or can anyone point me in the right direction? The page is at www.marimoninc.com/newsroom and the feed source is http://www.usa.canon.com/cusa/pressReleaseRss.action but it goes through feedburner first.
Read morecupcake.js
A new efficient HTML5 Web storage management js library is out now. This is called cupcake.js and is coming with rich set of features. You can check this at : http://www.rivindu.com/p/cupcakejs.html
Read moreExample Web Scraper - Feature
An example web scraper is now available. It ties together the modules that twistor and I released recently in an easy to understand (hopefully) demonstration. The whole suite is similar to SIMILE's Piggy Bank and Solvent workflow and modular architecture, in that users develop their queries in their browser and then run their configurations in a web service. It's entirely Drupal based (except for browser addons), so it should be an improvement on writing custom scripts that upload data to Drupal.
Read moreQueryPath - just the job for scraping
If you're still scraping content from other sites using a mixture of regular expressions and string searches in a HTTP page load then you should check out the QueryPath library!
With a bit of fiddling I've managed to scrape forum posts and extract usernames, dates and content in a small amount of lines without any complex regex.
Read moreWeb Scraper Recommendations - Extract Mailing Address Data?
We at UrbanMinistry.org would like to create a directory of church mailing addresses, with Google Maps/Earth capability. To facilitate this, we're looking for a Web scraping program that would extract mailing address data from Web sites. Does anyone have any suggestions?
Read moreImport an event email into a content type
I would like to find a way to have an event email sent to me via a specific email address and have it upload to a specific content type, myEvent.
It woudl be great if event date, time, place could be automatically extracted from the email.
has anyone done anything like this?
Chris
Read moreModule for mass importing an existing site?
I've been musing about the possibility of building a module for mass importing contents from existing sites, as an aid for migration to Drupal. I came across a potential model, an import utility for the CMS SAPID, http://sapid.sourceforge.net/en/doc/import/, code at http://prdownloads.sourceforge.net/sapid/migrator.tar.gz?download. The approach combines web crawling with regular expression-based data extraction. Users input the wrapping code patterns to search for, e.g., main content might be wrapped in and or maybe a div with a given class name.
Read moreMIT's Piggy Bank, Solvent for web scraping
Has anyone seen these products from MIT's Simile project?
- Piggy Bank - "Piggy Bank is an extension to the Firefox Web browser that turns it into a “Semantic Web browser,” letting you make use of existing information on the Web in more useful and flexible ways not offered by the original Web sites."
- Solvent - "Solvent is a Firefox extension that helps you write Javascript screen scrapers for Piggy Bank."
I have only begun to check this out but it is looking very cool.
Read more

