White boarding

Events happening in the community are now at Drupal community events on www.drupal.org.
Anonymous's picture

This discussion will be to white board ideas for how a module should be created to data map all types of data. You are welcome to comment if you have new information to add. You are requested to read the older comments before posting so that you know what has already been discussed and do not present us the same ideas. Some ideas have already been discussed here http://drupal.org/node/191197 and this discussion group was created to move the off topic discussion to a more logical place. The ideas already discussed in the pointed to ticket should be annotated here.

Comments

Controlling resource use

earnie@drupal.org's picture

One of the bigger issues with large datasets is the use of resources. Most users have a minimum amount of CPU and MEMORY such that creating the data via apache requests with PHP is a resource drain. I have been able to bootstrap Drupal from the command line to allow me to use the Drupal API to create nodes in automated fashion and thus remove the apache resource usage. Still if you resources are limited you do not want to overwhelm the server instance or your hosting service may need to take drastic action. What is good for one may never be good for another. We need to set some common ground and inform others how to tweak and tune there hosting service to get the maximum performance and how to minimize the impact of importing the data they need.

Import/Export API: useful resources

frank ralf's picture

I added some (hopefully) useful resources for the Import/Export API to the original issue queue: http://drupal.org/node/191197#comment-1403922

hth
Frank

May be good enough to build

summit's picture

May be good enough to build a module for Google Summer of code 2009,
module goal; to build a node-import/export solution using cron and keeping taxonomy relations for D5/D6 and D7?
Just a thought.
Greetings,
Martijn

Import/Export API may be a

earnie@drupal.org's picture

Import/Export API may be a start but there is a whole lot more to the issue than just importing the data. There is controlling the data and knowing when to remove it. Using feedAPI to accomplish this feat is most advantageous since it is already coded. I know it is going to depend on ones needs of the data but feedAPI will allow you to never remove a node as well. Alright, I'll admit that I haven't reviewed Import/Export API in a while and there may be things in common with feedAPI that we can help merge the two modules. I just know that feedAPI has been extended with add-on modules that can map other fields in the RSS to CCK fields.

RSS, RDF, XML, CSV, TSV, etc

earnie@drupal.org's picture

These formats and maybe more can be handled by feedAPI. Doing a Google with http://www.google.com/search?q=feedAPI+site%3Adrupal.org%2Fproject+-site... I find lots of nice modules that can be used to help us map our large data sets into nodes, cck fields and taxonomy. Of major importance may be the Feed Element Mapper module. This may just be the place to start with mapping datasets in a controlled fashion. Once we have the mapping down executing the data import will be the easy part.

Taxonomy

earnie@drupal.org's picture

One of my hobbies is publishing entry pages for merchants products. The merchant provides bulk data in various ways; which is why I'm so interested in mapping data. One of the most difficult things is controlling the taxonomy terms. Every merchant has its own ideas of how to categorize its products. Some merchants do a good job while others are just plain ignorant about it. However, recently I found a wonderful module to help manage the taxonomy based on the title and text of the node. This allows me to control the taxonomy with a static list of terms. Check out Taxonomy Autotagger for yourself. It just may be the golden module of bulk importing data for taxonomy.

Formatting the data into nodes

earnie@drupal.org's picture

One of the first questions I have was how to import data into a node. It is fairly easy to do in actuality but what about controlling the data that enters the node system? Will the data stay forever or will it need to be removed or unpublished at some point in the future. I have recently started using the feedAPI module in D6. I find that it formats the nodes properly from the RSS feeds I give it and removes the old data correctly. The problem at the moment is I don't always have an RSS feed so I need to format the data into RSS format. I will begin looking at the FeedAPI module a little harder to learn its API and maybe mapping to RSS isn't the thing I need; but using the Feed Element Mapper may be.

Hi Earnie, I have done

summit's picture

Hi Earnie,

I have done allmost the same I think, with the help of a programmer-friend.He build a feed_element_mapper to use geo-data with the feed element mapper to plot the data of the feed into location database tables.
I would until now not think of feedapi for importing nodes from another drupal instance..but it could also be a starting point.
May be there is already a module which can make rss-feeds from nodedata, then through feedapi we could import this new made rss-feed. So import/export is arranged then, export through a rss-feed builder from the wanted to export nodes, and attached taxonomies, as categories in the rss-feed? And import through feedapi. Is this the scenarion you are also thinking of?

Greetings,
Martijn

Drupal core already has an

earnie@drupal.org's picture

Drupal core already has an rss aggregator module for its nodes, taxonomy and other content. Using FeedAPI to gather that data is merely creating the feed for the RSS URL.

I'm used to creating nodes from an external source and know what I'm doing but don't know that I have the best scenario for it yet. I process 100K rows of data from several files and create and control the existence of nodes in batch style. However, FeedAPI takes care of the control with the date fields of the RSS. This is why I wanted to use the RSS formatting for creating the nodes. The control of the external data then becomes a matter of, did I receive an row of the same id in the last file, yes then update the last update else do nothing. Using feed_element_mapper may or may not be good for all data.

Data comes in various forms but nodes have more or less a set format. CCK fields are the "less of a set format" and populating these with RSS becomes less doable. Feed_element_mapper is geared toward fields of that are standard in RSS so we may need to extend it to take on pure XML.

Regards,
Earnie