PHP Web Scraping Libraries

Events happening in the community are now at Drupal community events on www.drupal.org.
You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

When writing custom or contrib modules that are aimed at web scraping, it makes most sense to use a library, without reinventing the wheel. Furthermore, it is advised to write a generic module first that handles scrapping requests and is capable of mapping them into fields of a specific content type, kind of the way the Feeds module does that. Targetting a specific site with specific selectors could then extend on that, either through a UI or a separate module's code.

Here is a list of useful links. If adding links, make sure they include working PHP examples. Furthermore, capture links in the Internet Archive: Wayback Machine so that if the URL is removed the URL can be changed to the web archive's snapshot. Staying in the Open Source spirit, solutions work without a subscription (API key) would be preferred, although it is true that when effective antibot measures are in place on the target site, this is almost inevitable.

Web Scraping

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: