As I have worked with different content management and middleware solutions, one problem I have encountered numerous times, and that I think needs a better solution in drupal, is import and export of data and provision of web services. I think it would be good for drupal to create an import/export framework that would allow drupal developers to make re-usable, sharable import and export modules by implementing PHP objects to a few commmon interfaces and then adding their new import/export type into a central, common import/export module.
This is especially relevant for modern corporate newspapers that need to integrate content from different legacy systems, produce that content, then send it off to a number of different web publishing systems, and then also face the prospect of their systems changing if they are bought or sold. The framework as outlined in the design below could also be a means of implementing REST services (SOAP or XML-RPC would probably require another layer on top of this, though you probably could do it in this framework).
In analyzing what a modern corporate newspaper needs in a middleware product, here are some requirements I've seen:
- needs to be able to import from pagination systems and other systems, as well, to allow flexibility in pulling the results of existing workflows into online without requiring the people who create the content to change what they do.
- needs to be able to export to multiple different content management systems and repositories, including news archives, content management systems, search indexes, text-taggers, etc.
- need a way to add and develop these imports and exports so that they can be swapped in and turned on or off by adding plugins to an import/export framework module.
- Want to be able to develop import and export modules to an API, so you can implement a class with certain methods and be guaranteed that it will work within the import/export framework.
- common delivery and receipt mechanisms should be abstracted out and included in module, so that you don't have to re-implement a directory file sweeper every time you need one for inbound XML files, and so that an HTTP or FTP client need not be implemented each time you need to send data over HTTP or FTP a file.
- the flow of an imported and exported document should be abstracted out and only pieces of the workflow that are different for a given import and export should require implementation to implement new import or export type.
- modules to support a given import or export format should be easily added to framework, so that once someone implements a great NITF import, for example, you can easily grab it and re-use it.
- probably need some object-oriented way to represent drupal nodes, too, so that we can take imported data and place them in these drupal objects, allowing us to then abstract out the processes for placing data into and retrieving data from drupal.
An import or export type could be defined as:
- method for identifying a file as one the type is designed to process
- inbound transformation
- implementation of import/export
- result transformation
The flow in would look something like this:
- receipt mechanism accepts incoming data (to start, would probably build a directory sweeper and an HTTP/HTTPS receipt page).
- figure out the type of incoming file. For directory sweeper, you could have one directory per type to start, so the directory where the file is dropped tells you what type it is. For HTTP, you can allow an HTTP parameter to contain import type, probably want a hook for each import type so you can pass the contents of an incoming file to a predefined method in the implementation of each import type that returns a boolean, true if it is a document or file your type is able to import. Might then create a master function accepts that accepts a file, goes through these functions and returns the type.
- based on type, call incoming transform, then pass the result to object for the selected import type that implements ImportExportProcessor interface. Easiest initial options for import transformation would be XSLT and a PHP class that implements a Transformation object interface (same interface for import and export, in case you ever need to re-use).
- depending on the type, deliver response that the ImportExportProcessor returns to submitter (if HTTP/HTTPS and not asynchronous, return result as response to request). Could have an result transformation as well, and make a hook in an import type for transforming the result, too, so you could re-use ImportExportProcessor implementations and just change the import and export transformations.
The flow out would be similar:
- receipt mechanism accepts data for export (could have a directory sweeper, could have PHP code that lets a process look up a service and submit directly, could also just re-use the HTTP/HTTPS listener page and let it do the looking up and processing).
- figure out the type of file. Probably use the same method as above.
- for the type, figure out how to transform and what ImportExportProcessor instance to use to do the export, then transform, export, and file the result. Would probably want HTTPImportExportProcessor, HTTPSImportExportProcessor, and FTPImportExportProcessor implementors, so that you could define many different types of export that are different transformations and then an HTTP or FTP request.
- return result to process that invoked the export, be it a PHP function call return value, an HTTP response, or something else (could even make a way to send an asynchronous confirmation via HTTP or FTP, but abstract it out so that it is part of overall framework, so that it is something you define by setting properties in a each type's definition, not something you have to implement in each type).
The framework would be configured through either an XML file or database tables that hold the type definitions. You would probably want to build an easy way to add, edit, or remove a type.
Thoughts? I have implemented this before in Java and I think implementing the framework and a few test import/export types would be at most a summer-long project using PHP 5.
Comments
A really smart student did
A really smart student did such a the importexportapi as a summer project and we ended up with software so complex that noone uses it. I am sceptical of that software can be as functional as you describe considering how flexible drupal is and still be simple enough to grok in an hour.
simplicity
I implemented something similar to this framework once, in Java, for a computer consulting company in Virginia, and it wasn't necessarily something that could be grokked in an hour, but it is the result of looking at the common parts of processes that occur in integrations and abstracting out things that are re-used, and it turned out to be grokkable in a day or two, which was pretty good given the complexity of integrations.
We ended up using this framework to create a number of re-usable integrations. One tied the moodle learning management system into a university management system (it took a month or so to implement initially, but now that the underlying code is implemented, it can be installed and configured without any programming), one took data from multiple Point of Sale systems that speak a number of different proprietary export languages, had transformations to convert them to a single XML dialect, then implemented importing of messages in that single dialect so the different systems could all take advantage of the same set of APIs to interact with university financial aid systems to see if a student could buy books with their FA (this is analogous to newspaper imports - make transformations to convert any incoming story format to one format you settle on, then implement logic to import that format).
While there is initial development time for a given integration path, you can then re-use the modules (so you implement an NITF-to-drupal object, then you just have to implement transformations to NITF from other formats, then you pass the NITF to the NITF-to-drupal object).
It would be a component that would underlie things, though, and not necessarily something many end-users would deal with, other than maybe to enable some of these pre-built modules and transformations.
And in relation to "the importexportapi," this is at a much higher level. This isn't intended as an API for defining data entities. It is just a means of re-using and tying together transformations and code that implements importing and exporting. For this to be useful, you'd probably have to also tie in a way to show the pre-defined node types that the transformation code knows how to work with.
I have built it before, will probably build it again myself if I end up using drupal, but figured I'd throw it out there to see if anyone else is interested, or has seen anything like it. Thanks for the thoughts, and for pointing me to that module!
how to refrase
Moshe, you are one of the expert in this field since you even started a company for data import. If the project as described is to much of a bite, what could Jonathan do that builds on last SOC project and that would make a difference in this field?
--
Check out more of my writing on our blog and my Twitter account.
I suggest reviewing the
I suggest reviewing the issue queues for one of the projects that jpetso mentions below and completing one of the more meaty items there.
Thoughts
I will need an import/export framework as well, and will probably work on this in one or two months as well. The above proposal looks not bad, I'll just throw my thoughts in here.
Overall, I think you're on the right track with an extensible, API-based approach. The most important part, in my opinion, is to avoid creating a completely new import/export ecosystem from scratch. This has been tried in a previous Summer of Code (2006) with the Import/Export API and is now slowly rotting away because of the high effort for maintaining a complete import/export framework including plugins. So your key question must be how to share as much as possible code with the community and how to cooperate with people who need stuff like this anyways.
You absolutely need to investigate different existing solutions and have an eye on reusing concepts and/or code. Here's a (possibly incomplete) list of prior art:
So, you'll need to find ways to work together and integrate with at least one of those (maybe more), make it possible for any of them to utilize your framework (or make it possible for your framework to utilize some of those), and have a lot of communication with the maintainers of those solutions. If you work on your project alone and without support from other people with similar needs, you've already lost.
The second large issue
...is to consider use cases different from the newspaper workflow that you're coming from. I'll just add our requirements, perhaps that can help in generalizing the proposed architecture and allowing further extensibility.
FeedAPI has the requirement of fetching feeds from an URL, Node Import gets them from an uploadable text file as far as I can see. Feed Element Mapper has a cool mapping interface already, it might be worthwhile to extend and generalize this so that it accomodates generic field mappings for import and export. (Improving the existing project is of course preferable to forking it.)
The third big issue
...is to keep the API as minimal and extensible as possible, so that further maintenance of the project is not one large piece but can be split up in a set of independent, smaller modules, so that once you or anyone else lose interest in maintaining parts of the framework can leave those parts behind without slowing down the whole project. This is the fundamental issue that happened to the Import/Export API, you'd do good to avoid large maintenance responsibilities. (That's especially true for a Summer of Code project where it's not clear that the student will keep motivation and/or funding to continue working hard on the project.)
Other stuff
I think you might want to decouple the import/export processor (XML, RDF, CSV, RSS, Atom, YAML, ...) from the transport (file import/export, HTTP(s), FTP) as that's really two different pairs of shoes. Also, figuring out the filetype is probably stuff that a single processor should be responsible for, I don't think this belongs into the core of the whole framework and workflow.
Finally, avoid to think big. Despite the many use cases mentioned here, start small with a well-defined set of tasks that work on their own. When coded with an eye on the big picture, the framework can always be extended later on, but an overengineered framework that tries to do everything will die a painful death sooner or later. Do it step by step, it's crucial for a thing like this to succeed.
Ok, that's pretty much my conclusions of the last month thinking about stuff like this. I appreciate your proposal and hope it gets somewhere, but this is dangerous ground. Beware.
Spot On!
A project from last year's proposals which i thought showed a lot of promise was the "Dublin Core integration into CCK"(http://groups.drupal.org/node/3344) - The RDF API makes me very happy.
jpetso, you've done an amazing job with your feedback here. I'd like to second the thoughts about keepint it small, and decoupling the "processors" from the "transport"
What I would personally like to see in the near future is a semantic format to access all CCK data.
This is a great subject, it so needs to be done!
I saw all the explanations above, and I feel like at this point this proposal might be in danger not to be treated, however it also seems to be very important for the community to push the gates of webservices entry point for cck data types.
NB: I will only speak about CCK on this comment, because I truthfully believe that is not Drupal’s work to deal with other entries than contents (which already includes a lot...). Building another “ws entry for all table types” is going a bit far for now. Let’s be realistic, there are other frameworks dealing with “all-table/data-types-manipulating”, Drupal is the one who makes life easier for community contents.
We can help, we also started that!
I have to say that a team in my company worked on that specific area last summer, and they released some code enabling an entry of yahoo API & Amazon API onto CCK node fields (using an administrative binder from an XML sample to a set of cck fields). Unfortunately, I don’t have a demo because we changed a lot of things on our ideas since last summer, and this functionality was broken (of course, there is still some code I can provide, not stylish though, but a code which almost did what I describe just below).
Usecase #1
Anyway, I’m sure one of the usecases we want to reach might be described very clearly and pragmatically for everyone. Let’s go back to our Amazon sample. There is a guy called the “admin guy” which creates a CCK-node type “cck_music_album” with the fields title, artist & image for instance. He uses one of the modules (to be done, or an already existing one on the community) to say: hey “here is an xml sample of an Amazon-music album”. Then, the admin part lets him choose which xml field should be going to which specific cck-field (basically, it’s the mapper, whatever the one we should use).
Once done, let’s admit the end-user now has to create a cck_music_album. He basically has two choices (so he’ll have a pre-form-choice like before the actual form-add)
NB: the “save part” might be treated with different solutions:
Architecture
First, I won’t pretend this is a “global-usecase”, obviously there are plenty of other things to do. But this one has a clear advantage, it’s easy to understand, easy to use (because Amazon service consumer already exists for instance), and we can see what might be the possible layers:
What can be extended later? The “many services source thing”. And many services used at the same time to create a mixed content... why not ? But not for today.
What can be extended later? The “xml thing”. Already existing modules have been perfectly pointed out in a former post, so I’ve nothing to add on that part.
Big issue to be solved on this first model:
I would say that I’m not a huge fan anymore of the solution “high-level, going through javascript/form_execute like”, even though we started by this option.
Our major issue was about pictures. If you take a picture coming from Amazon, you basically have only one url, and the cck-image-widget should display this url image. But if the user now save its own picture instead, it should use imagefield to be displayed. This duality was not solved in our solution, but it’s now mandatory to be solved, for us at least.
Proposition & management:
It has been said before, there is this strategic development battle plan to follow: make use and reuse of what has been done in other modules, be driven by pragmatic usecases, separate things into layers, each one of them being a “working corpus”.
My compagny is willing to sponsor a “before gSoC work”, leading to a first pragmatic release of something like Amazon binding to CCK (solving the image issue), which would be a great starting point to deal with the “growing and abstracting project” during the actual gSoC.
So we are completely open for any proposal which makes this project go forward!
--
Laurent Pipitone
BiliB Co-Founder
--
Laurent Pipitone
BiliB Co-Founder
A users perspective
Hi all.. I am not a programmer but a user. We are a radio station that could use this sort of functionality for our operation. Currently, we use a closed news editing system to create newscasts for on the air. The news content is stored in a file that is mostly html. We use a vbscript to extract the text info and then push it into the mysql database for presentation on the website. It works fine.. BUT, I am keen on moving from a custom coded php site to drupal. If this module existed, the move would be quite doable.
Thanks
Rob
CKNX Radio
doobiedoo
I now tried something like this with a module called Transformations. Someone who stumbles upon this page might find this worth a look.