Last updated by lolandese on Fri, 2024-03-01 09:45
When writing custom or contrib modules that are aimed at web scraping, it makes most sense to use a library, without reinventing the wheel. Furthermore, it is advised to write a generic module first that handles scrapping requests and is capable of mapping them into fields of a specific content type, kind of the way the Feeds module does that. Targetting a specific site with specific selectors could then extend on that, either through a UI or a separate module's code.
Here is a list of useful links. If adding links, make sure they include working PHP examples. Furthermore, capture links in the Internet Archive: Wayback Machine so that if the URL is removed the URL can be changed to the web archive's snapshot. Staying in the Open Source spirit, solutions work without a subscription (API key) would be preferred, although it is true that when effective antibot measures are in place on the target site, this is almost inevitable.