create modular, extensible, API-based import and export framework for drupal

Posted by jonathan.morgan on March 22, 2008 at 11:51am

As I have worked with different content management and middleware solutions, one problem I have encountered numerous times, and that I think needs a better solution in drupal, is import and export of data and provision of web services. I think it would be good for drupal to create an import/export framework that would allow drupal developers to make re-usable, sharable import and export modules by implementing PHP objects to a few commmon interfaces and then adding their new import/export type into a central, common import/export module.

This is especially relevant for modern corporate newspapers that need to integrate content from different legacy systems, produce that content, then send it off to a number of different web publishing systems, and then also face the prospect of their systems changing if they are bought or sold. The framework as outlined in the design below could also be a means of implementing REST services (SOAP or XML-RPC would probably require another layer on top of this, though you probably could do it in this framework).

In analyzing what a modern corporate newspaper needs in a middleware product, here are some requirements I've seen:

needs to be able to import from pagination systems and other systems, as well, to allow flexibility in pulling the results of existing workflows into online without requiring the people who create the content to change what they do.
needs to be able to export to multiple different content management systems and repositories, including news archives, content management systems, search indexes, text-taggers, etc.
need a way to add and develop these imports and exports so that they can be swapped in and turned on or off by adding plugins to an import/export framework module.
Want to be able to develop import and export modules to an API, so you can implement a class with certain methods and be guaranteed that it will work within the import/export framework.
common delivery and receipt mechanisms should be abstracted out and included in module, so that you don't have to re-implement a directory file sweeper every time you need one for inbound XML files, and so that an HTTP or FTP client need not be implemented each time you need to send data over HTTP or FTP a file.
the flow of an imported and exported document should be abstracted out and only pieces of the workflow that are different for a given import and export should require implementation to implement new import or export type.
modules to support a given import or export format should be easily added to framework, so that once someone implements a great NITF import, for example, you can easily grab it and re-use it.
probably need some object-oriented way to represent drupal nodes, too, so that we can take imported data and place them in these drupal objects, allowing us to then abstract out the processes for placing data into and retrieving data from drupal.

An import or export type could be defined as:

method for identifying a file as one the type is designed to process
inbound transformation
implementation of import/export
result transformation

The flow in would look something like this:

receipt mechanism accepts incoming data (to start, would probably build a directory sweeper and an HTTP/HTTPS receipt page).
figure out the type of incoming file. For directory sweeper, you could have one directory per type to start, so the directory where the file is dropped tells you what type it is. For HTTP, you can allow an HTTP parameter to contain import type, probably want a hook for each import type so you can pass the contents of an incoming file to a predefined method in the implementation of each import type that returns a boolean, true if it is a document or file your type is able to import. Might then create a master function accepts that accepts a file, goes through these functions and returns the type.
based on type, call incoming transform, then pass the result to object for the selected import type that implements ImportExportProcessor interface. Easiest initial options for import transformation would be XSLT and a PHP class that implements a Transformation object interface (same interface for import and export, in case you ever need to re-use).
depending on the type, deliver response that the ImportExportProcessor returns to submitter (if HTTP/HTTPS and not asynchronous, return result as response to request). Could have an result transformation as well, and make a hook in an import type for transforming the result, too, so you could re-use ImportExportProcessor implementations and just change the import and export transformations.

The flow out would be similar:

receipt mechanism accepts data for export (could have a directory sweeper, could have PHP code that lets a process look up a service and submit directly, could also just re-use the HTTP/HTTPS listener page and let it do the looking up and processing).
figure out the type of file. Probably use the same method as above.
for the type, figure out how to transform and what ImportExportProcessor instance to use to do the export, then transform, export, and file the result. Would probably want HTTPImportExportProcessor, HTTPSImportExportProcessor, and FTPImportExportProcessor implementors, so that you could define many different types of export that are different transformations and then an HTTP or FTP request.
return result to process that invoked the export, be it a PHP function call return value, an HTTP response, or something else (could even make a way to send an asynchronous confirmation via HTTP or FTP, but abstract it out so that it is part of overall framework, so that it is something you define by setting properties in a each type's definition, not something you have to implement in each type).

The framework would be configured through either an XML file or database tables that hold the type definitions. You would probably want to build an easy way to add, edit, or remove a type.

Thoughts? I have implemented this before in Java and I think implementing the framework and a few test import/export types would be at most a summer-long project using PHP 5.

Comments

A really smart student did

Posted by moshe weitzman on March 22, 2008 at 1:23pm

A really smart student did such a the importexportapi as a summer project and we ended up with software so complex that noone uses it. I am sceptical of that software can be as functional as you describe considering how flexible drupal is and still be simple enough to grok in an hour.

simplicity

Posted by jonathan.morgan on March 22, 2008 at 1:49pm

I implemented something similar to this framework once, in Java, for a computer consulting company in Virginia, and it wasn't necessarily something that could be grokked in an hour, but it is the result of looking at the common parts of processes that occur in integrations and abstracting out things that are re-used, and it turned out to be grokkable in a day or two, which was pretty good given the complexity of integrations.

We ended up using this framework to create a number of re-usable integrations. One tied the moodle learning management system into a university management system (it took a month or so to implement initially, but now that the underlying code is implemented, it can be installed and configured without any programming), one took data from multiple Point of Sale systems that speak a number of different proprietary export languages, had transformations to convert them to a single XML dialect, then implemented importing of messages in that single dialect so the different systems could all take advantage of the same set of APIs to interact with university financial aid systems to see if a student could buy books with their FA (this is analogous to newspaper imports - make transformations to convert any incoming story format to one format you settle on, then implement logic to import that format).

While there is initial development time for a given integration path, you can then re-use the modules (so you implement an NITF-to-drupal object, then you just have to implement transformations to NITF from other formats, then you pass the NITF to the NITF-to-drupal object).

It would be a component that would underlie things, though, and not necessarily something many end-users would deal with, other than maybe to enable some of these pre-built modules and transformations.

And in relation to "the importexportapi," this is at a much higher level. This isn't intended as an API for defining data entities. It is just a means of re-using and tying together transformations and code that implements importing and exporting. For this to be useful, you'd probably have to also tie in a way to show the pre-defined node types that the transformation code knows how to work with.

I have built it before, will probably build it again myself if I end up using drupal, but figured I'd throw it out there to see if anyone else is interested, or has seen anything like it. Thanks for the thoughts, and for pointing me to that module!

how to refrase

Posted by kvantomme on April 3, 2008 at 6:19am

Moshe, you are one of the expert in this field since you even started a company for data import. If the project as described is to much of a bite, what could Jonathan do that builds on last SOC project and that would make a difference in this field?

Check out more of my writing on our blog and my Twitter account.

I suggest reviewing the

Posted by moshe weitzman on April 3, 2008 at 11:02am

I suggest reviewing the issue queues for one of the projects that jpetso mentions below and completing one of the more meaty items there.

Thoughts

Posted by jpetso on March 22, 2008 at 2:17pm

I will need an import/export framework as well, and will probably work on this in one or two months as well. The above proposal looks not bad, I'll just throw my thoughts in here.

Overall, I think you're on the right track with an extensible, API-based approach. The most important part, in my opinion, is to avoid creating a completely new import/export ecosystem from scratch. This has been tried in a previous Summer of Code (2006) with the Import/Export API and is now slowly rotting away because of the high effort for maintaining a complete import/export framework including plugins. So your key question must be how to share as much as possible code with the community and how to cooperate with people who need stuff like this anyways.

You absolutely need to investigate different existing solutions and have an eye on reusing concepts and/or code. Here's a (possibly incomplete) list of prior art:

The above mentioned Import/Export API. A generic framework much like the one that you propose, but out of maintenance, and unlikely to be resurrected without major external contributions. You might want to improve on it, or take good ideas and avoid its pitfalls.
Node Import imports files into nodes, but doesn't have anything to do with export stuff. Actively maintained, good solution for a relatively narrow set of requirements.
FeedAPI (created during last year's summer of code) and Feed Element Mapper, both actively maintained by Development Seed. It's not formally a generic import framework, but has a lots of stuff that push it into this direction: pluggable import parsers (called processors), mapping external data fields to nodes and CCK fields, modular element mappers. Also doesn't have to do anything with export.
The Data Architecture Design Sprint group on groups.drupal.org that recently sat together to discuss and flesh out a vision for fields as the building block of data in Drupal (as opposed to nodes). This sprint resulted (amongst other ideas) in the concept of nodes being composed of multiple fields where each field is an object and encapsulates reading and writing from external systems (legacy databases, for example) on the fly. Or something. Do read the posts in this group, it's highly informative.
Arto Bendiken's RDF API (further reading: here and here) seems to be the next best thing in Drupal data interoperability and is one of my favorite approaches at the moment. What's really promising here is Dries' intention to push RDF for Drupal, which means that the RDF movement currently has quite a bit of momentum and might even get into Drupal core. Given that each import/export framework needs one single central format as pivotal point, RDF triples might be it. I think that basing your import/export framework on the RDF API instead of some custom node object format will multiply the interest of the community in this project and probably attract other contributors as well, making this thing much more maintainable.
CoCKTaiL by Raincity Studios' Djun Kim. Still in the works, I believe.
Lastly, Jeff Eaton's node rendering patch has been identified by Dries as probably the next step for Drupal core that needs to happen in regards to node export. It has been quite idle for a while, but is currently gaining speed again after being split up in "bitesize" tasks. Nevertheless, this is more a long-term undertaking and less pursuing the idea of a generic import/export framework.

So, you'll need to find ways to work together and integrate with at least one of those (maybe more), make it possible for any of them to utilize your framework (or make it possible for your framework to utilize some of those), and have a lot of communication with the maintainers of those solutions. If you work on your project alone and without support from other people with similar needs, you've already lost.

The second large issue

...is to consider use cases different from the newspaper workflow that you're coming from. I'll just add our requirements, perhaps that can help in generalizing the proposed architecture and allowing further extensibility.

One of our use cases is importing job data that comes in through some XML format. There's an XML Schema for that format, and we want to map elements of that schema to CCK elements in our Drupal nodes.
Also, we're using pageroute to distribute data of a single applicant over multiple nodes / content types. So a solution that works for us needs to be able to take one input element and map it to multiple target nodes / content types. (Perhaps with the help of a proxy processor that makes use of single-node importers/exporters.)
We want to export existing data (possibly composed of multiple nodes, as above) into a file or database, and re-import it using a different mapping. That way, we could alter the structure of our existing content types without writing tedious upgrade procedures.
In the end, an import/export framework could make for a nice conversion service - import data and throw it back out in another format.
Map fields either by example (like is currently done by Feed Element Mapper) or by a predefined schema (Relax-NG, XML Schema, ...)
And of course, the standard CSV/textfile import/export for whatever data someone might need.

FeedAPI has the requirement of fetching feeds from an URL, Node Import gets them from an uploadable text file as far as I can see. Feed Element Mapper has a cool mapping interface already, it might be worthwhile to extend and generalize this so that it accomodates generic field mappings for import and export. (Improving the existing project is of course preferable to forking it.)

The third big issue

...is to keep the API as minimal and extensible as possible, so that further maintenance of the project is not one large piece but can be split up in a set of independent, smaller modules, so that once you or anyone else lose interest in maintaining parts of the framework can leave those parts behind without slowing down the whole project. This is the fundamental issue that happened to the Import/Export API, you'd do good to avoid large maintenance responsibilities. (That's especially true for a Summer of Code project where it's not clear that the student will keep motivation and/or funding to continue working hard on the project.)

Other stuff

I think you might want to decouple the import/export processor (XML, RDF, CSV, RSS, Atom, YAML, ...) from the transport (file import/export, HTTP(s), FTP) as that's really two different pairs of shoes. Also, figuring out the filetype is probably stuff that a single processor should be responsible for, I don't think this belongs into the core of the whole framework and workflow.

Finally, avoid to think big. Despite the many use cases mentioned here, start small with a well-defined set of tasks that work on their own. When coded with an eye on the big picture, the framework can always be extended later on, but an overengineered framework that tries to do everything will die a painful death sooner or later. Do it step by step, it's crucial for a thing like this to succeed.

Ok, that's pretty much my conclusions of the last month thinking about stuff like this. I appreciate your proposal and hope it gets somewhere, but this is dangerous ground. Beware.

Spot On!

Posted by houndbee on March 23, 2008 at 6:05am

A project from last year's proposals which i thought showed a lot of promise was the "Dublin Core integration into CCK"(http://groups.drupal.org/node/3344) - The RDF API makes me very happy.

jpetso, you've done an amazing job with your feedback here. I'd like to second the thoughts about keepint it small, and decoupling the "processors" from the "transport"

What I would personally like to see in the near future is a semantic format to access all CCK data.

This is a great subject, it so needs to be done!

Posted by striky2 on March 28, 2008 at 1:18am

I saw all the explanations above, and I feel like at this point this proposal might be in danger not to be treated, however it also seems to be very important for the community to push the gates of webservices entry point for cck data types.

NB: I will only speak about CCK on this comment, because I truthfully believe that is not Drupal’s work to deal with other entries than contents (which already includes a lot...). Building another “ws entry for all table types” is going a bit far for now. Let’s be realistic, there are other frameworks dealing with “all-table/data-types-manipulating”, Drupal is the one who makes life easier for community contents.

We can help, we also started that!

I have to say that a team in my company worked on that specific area last summer, and they released some code enabling an entry of yahoo API & Amazon API onto CCK node fields (using an administrative binder from an XML sample to a set of cck fields). Unfortunately, I don’t have a demo because we changed a lot of things on our ideas since last summer, and this functionality was broken (of course, there is still some code I can provide, not stylish though, but a code which almost did what I describe just below).

Usecase #1

Anyway, I’m sure one of the usecases we want to reach might be described very clearly and pragmatically for everyone. Let’s go back to our Amazon sample. There is a guy called the “admin guy” which creates a CCK-node type “cck_music_album” with the fields title, artist & image for instance. He uses one of the modules (to be done, or an already existing one on the community) to say: hey “here is an xml sample of an Amazon-music album”. Then, the admin part lets him choose which xml field should be going to which specific cck-field (basically, it’s the mapper, whatever the one we should use).
Once done, let’s admit the end-user now has to create a cck_music_album. He basically has two choices (so he’ll have a pre-form-choice like before the actual form-add)

Create the node using the traditional form
Make a search through Amazon (we don’t really care about this part on this specific discussion), get an xml element in return, and save what we need the way it has been configured by the admin guy.

NB: the “save part” might be treated with different solutions:

Way #1: there is a copy of the xml-data onto the cck-form-add (field by field), so that the user can alter the data through the traditional form add (default values) before it’s actually saved in the cck tables (this might require either a smart use of “default values” or a lot of javascript widgets to be developed and declared in the binder by the admin, ex: a date xml will give birth to another date format on the cck-add-form).
Way #1bis: A variant of this proposition is to use the (hook_execute thing) if we don’t want the user to save the form.
Way #2: there is no copy/alteration of the xml data onto the cck-form-add, and it goes directly into the database. (I don’t know how it’s working for feedAPI, but it should be very similar right?)

Architecture

First, I won’t pretend this is a “global-usecase”, obviously there are plenty of other things to do. But this one has a clear advantage, it’s easy to understand, easy to use (because Amazon service consumer already exists for instance), and we can see what might be the possible layers:

Layer 1: the data source itself. Why not starting by Amazon, which is one of the most famous existing WebServices, already well used in Drupal?
What can be extended later? The “many services source thing”. And many services used at the same time to create a mixed content... why not ? But not for today.
Layer 2: the data source admin mapping (low level mapping: pure data xml to database field by field, or high level mapping: data xml presentation to cck form, then use cck-form processing to actually save the data into the base?)
What can be extended later? The “xml thing”. Already existing modules have been perfectly pointed out in a former post, so I’ve nothing to add on that part.
Layer 3: the data processing. Well it can be low level (save xml to database), there are obviously some modules dealing with that same issue, or high level (just save the cck-form).

Big issue to be solved on this first model:

I would say that I’m not a huge fan anymore of the solution “high-level, going through javascript/form_execute like”, even though we started by this option.
Our major issue was about pictures. If you take a picture coming from Amazon, you basically have only one url, and the cck-image-widget should display this url image. But if the user now save its own picture instead, it should use imagefield to be displayed. This duality was not solved in our solution, but it’s now mandatory to be solved, for us at least.

Proposition & management:

It has been said before, there is this strategic development battle plan to follow: make use and reuse of what has been done in other modules, be driven by pragmatic usecases, separate things into layers, each one of them being a “working corpus”.

My compagny is willing to sponsor a “before gSoC work”, leading to a first pragmatic release of something like Amazon binding to CCK (solving the image issue), which would be a great starting point to deal with the “growing and abstracting project” during the actual gSoC.

So we are completely open for any proposal which makes this project go forward!

--
Laurent Pipitone
BiliB Co-Founder

A users perspective

Posted by renders on April 28, 2008 at 10:51pm

Hi all.. I am not a programmer but a user. We are a radio station that could use this sort of functionality for our operation. Currently, we use a closed news editing system to create newscasts for on the air. The news content is stored in a file that is mostly html. We use a vbscript to extract the text info and then push it into the mysql database for presentation on the website. It works fine.. BUT, I am keen on moving from a custom coded php site to drupal. If this module existed, the move would be quite doable.

Thanks

Rob

CKNX Radio

doobiedoo

Posted by jpetso on March 25, 2009 at 12:19pm

I now tried something like this with a module called Transformations. Someone who stumbles upon this page might find this worth a look.

create modular, extensible, API-based import and export framework for drupal

Comments

A really smart student did

simplicity

how to refrase

I suggest reviewing the

Thoughts

Spot On!

This is a great subject, it so needs to be done!

We can help, we also started that!

Usecase #1

Architecture

Big issue to be solved on this first model:

Proposition & management:

A users perspective

doobiedoo

Newspapers on Drupal

Group organizers

Group categories

Topics - Newspaper on Drupal

New groups

Group notifications

Hot content this week