Module for mass importing an existing site?

Events happening in the community are now at Drupal community events on www.drupal.org.
nedjo's picture

I've been musing about the possibility of building a module for mass importing contents from existing sites, as an aid for migration to Drupal. I came across a potential model, an import utility for the CMS SAPID, http://sapid.sourceforge.net/en/doc/import/, code at http://prdownloads.sourceforge.net/sapid/migrator.tar.gz?download. The approach combines web crawling with regular expression-based data extraction. Users input the wrapping code patterns to search for, e.g., main content might be wrapped in and or maybe a div with a given class name.

Another available PHP tool is the Snoopy class, http://sourceforge.net/projects/snoopy/, which has some useful methods, see also the tutorial and functions at http://www.jjwdesign.com/data_mining_functions.html (some of which mirror stuff we already in Drupal).

Comments

nedjo,

dado's picture

sounds like what you need is addressed by dman's Import HTML module
http://drupal.org/project/import_html

I believe this module combines web crawling with XSLT/Xpath
dman is a good resource and could likely help you get started.

Thanks!

nedjo's picture

That looks great and is exactly what I'm looking for. (Reminder to self: look through recent module additions first!)

OnlineHonesty.com

mjolley's picture

I had an HTML site that I oprted to Drupal. I tried various import modules, and they didn't do much for me.

I had an existing forum using phpBB which I successfully imported with phpBB2Drupal or whatever. That worked.

The original format of the site was a 3-column blog. I wrote it as HTML tables using NVu, and I wish I knew about CMS's before I started on that project.

So I had dozens of HTML pages that I needed to import into my new Drupal replacement site. Here's what I ended up doing:

I found that none of the import modules did the presentation justice. I ended up copying content from my old HTML pages and pasting them into the Tinymce editor in my Drupal site. As a programmer, I hate to do stuff like this manually, but I simply couldn't find anything more efficient.

Why not a JQuery-style selector?

chadj@drupal.org's picture

It seems like scraping most sites could be done more easily with a CSS/XPATH selector like JQuery.

I just tried the import module and it's hopelessly complex. This should be a simple Javascript application. You just provide a Domain, a SiteMap URL and the name of the main container DIV (usually "main").

It should be a matter of fetching each page, grabbing it's body and meta data then adding nodes with the correct path alias and content.

XSL? HTML Tidy? Folder paths? Exclusion lists? Funky PERL scripts? Why?

ChadJ


Free Site Monitor
Keyword Marketing Ladder

Web Scraping

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: