Module for mass importing an existing site?

Posted by nedjo on July 12, 2006 at 5:39am

I've been musing about the possibility of building a module for mass importing contents from existing sites, as an aid for migration to Drupal. I came across a potential model, an import utility for the CMS SAPID, http://sapid.sourceforge.net/en/doc/import/, code at http://prdownloads.sourceforge.net/sapid/migrator.tar.gz?download. The approach combines web crawling with regular expression-based data extraction. Users input the wrapping code patterns to search for, e.g., main content might be wrapped in and or maybe a div with a given class name.

Another available PHP tool is the Snoopy class, http://sourceforge.net/projects/snoopy/, which has some useful methods, see also the tutorial and functions at http://www.jjwdesign.com/data_mining_functions.html (some of which mirror stuff we already in Drupal).

Comments

nedjo,

Posted by dado on July 13, 2006 at 1:09pm

sounds like what you need is addressed by dman's Import HTML module
http://drupal.org/project/import_html

I believe this module combines web crawling with XSLT/Xpath
dman is a good resource and could likely help you get started.

Thanks!

Posted by nedjo on July 13, 2006 at 4:15pm

That looks great and is exactly what I'm looking for. (Reminder to self: look through recent module additions first!)

OnlineHonesty.com

Posted by mjolley on December 28, 2006 at 12:38am

I had an HTML site that I oprted to Drupal. I tried various import modules, and they didn't do much for me.

I had an existing forum using phpBB which I successfully imported with phpBB2Drupal or whatever. That worked.

The original format of the site was a 3-column blog. I wrote it as HTML tables using NVu, and I wish I knew about CMS's before I started on that project.

So I had dozens of HTML pages that I needed to import into my new Drupal replacement site. Here's what I ended up doing:

I found that none of the import modules did the presentation justice. I ended up copying content from my old HTML pages and pasting them into the Tinymce editor in my Drupal site. As a programmer, I hate to do stuff like this manually, but I simply couldn't find anything more efficient.

Why not a JQuery-style selector?

Posted by chadj@drupal.org on July 11, 2007 at 3:56am

It seems like scraping most sites could be done more easily with a CSS/XPATH selector like JQuery.

I just tried the import module and it's hopelessly complex. This should be a simple Javascript application. You just provide a Domain, a SiteMap URL and the name of the main container DIV (usually "main").

It should be a matter of fetching each page, grabbing it's body and meta data then adding nodes with the correct path alias and content.

XSL? HTML Tidy? Folder paths? Exclusion lists? Funky PERL scripts? Why?

ChadJ

Free Site Monitor
Keyword Marketing Ladder

ChadJ

Free Site Monitor
Keyword Marketing Ladder

Module for mass importing an existing site?

Comments

nedjo,

Thanks!

OnlineHonesty.com

Why not a JQuery-style selector?

Web Scraping

Group organizers

New groups

Group notifications