Import/Export API: +5

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
dado's picture

I have been a recent contributor to node_import, plus I just published a new module in this space, scraper, which is a HTML scraping tool. I definitely think this project is needed.

This API potentially has major implications for Drupal core, and should probably some day be a part of Drupal core. I have been surprised to find how much Drupal core is tied to the form for doing node operations. For example, I found that node validation is almost totally hard-coded to the node form submission process. I found this to be a major challenge for permitting node_import to validate nodes it is about to import. Perhaps you will find better workarounds to validating a node as you import it.
I encourage you to check with Robrecht Jaques, who is working on an overhaul of the node_import module. He might have solved some of these problems.

Here is my wish list for this API

  • hooks/methods of feeding an array in for import. That is, data to be imported need not be first converted to XML
  • Perhaps the API could support a methodology for converting nodes from 1 type to another. Since converting types effectively involves exporting (and deleting) a node of 1 type and importing the node of the other type, I would think this API could do it.
  • support complex rules for determining whether the node to be imported already exists (e.g. user/client can specify which field or fields identify uniqueness for a particular node type)
  • when importing, permit some discovery of similar nodes
  • support multiple actions when a node to be imported already exists (e.g. user/client can specify which new fields overwrite which old fields in the already-existing node)
  • implementing modules could support multiple options for the import process: import without question, show a form containing every node to be imported, with ability to check off/on each node. in other words, hopefully you would expose hooks into the middle of the import process to permit intervention by user.
  • i look forward to see your developments!

    Comments

    node validation

    moshe weitzman's picture

    adrian is working on fapi 2.0 where programatically creating nodes is easier. 4.7 is in an intermediate where it is quite hard to do validation, as you've noted. fear not, we will get this right. in fact, programatically "submitting" any form will be quite easy.

    Feedback

    Jaza's picture

    Thanks very much for your feedback, dado! Please see my comments below...

    I have been surprised to find how much Drupal core is tied to the form for doing node operations. For example, I found that node validation is almost totally hard-coded to the node form submission process. I found this to be a major challenge for permitting node_import to validate nodes it is about to import. Perhaps you will find better workarounds to validating a node as you import it.

    As moshe explained, the validation issue is caused by the fact that 4.7/FAPI 1.0 is an unfinished solution, that will certainly be cleaned up with 4.8/FAPI 2.0. If you download Drupal 4.6, you will see that validation was previously tied directly to the various _save() functions (e.g. node_save, user_save) - this is one spot where 4.6 was actually "better" than 4.7 (although pretty much anything else form-related was worse in 4.6). The idea is that we're going 1 step backward (with validation) in 4.7, but we'll be going 2 steps forward in 4.8. =)

    Also, I am very lucky to have Adrian mentoring me on this project, and he has thrown around the idea of the import / export API (or the data definition part of it, at least) eventually becoming the foundation of FAPI 2.0. I still need to discuss this more with Adrian, as I'm not sure exactly what he's got in mind at the moment. However, if this is the case, then the whole validation callback system will be moved to the data definition layer, that is currently being developed within my module.

    I encourage you to check with Robrecht Jaques, who is working on an overhaul of the node_import module. He might have solved some of these problems.

    I have communicated a little with Robrecht - he too has been troubled by the problems with programatically creating nodes in 4.7 (mainly the validation issue). He has also indicated that he supports the development of the import / export API, and that he will consider rewriting node_import to be built on top of it, when the API is complete.

    Here is my wish list for this API
    # hooks/methods of feeding an array in for import. That is, data to be imported need not be first converted to XML

    The API will definitely cater to this. Every import or export is explicitly separated into a 'get' operation (e.g. get an XML string, turn it into a structured array), and a 'put' operation (e.g. take a structured array, 'put' its values into the database). Modules will be able to handle getting the data on their own, if they wish, in which case they can just run the 'put' operation directly.

    # Perhaps the API could support a methodology for converting nodes from 1 type to another. Since converting types effectively involves exporting (and deleting) a node of 1 type and importing the node of the other type, I would think this API could do it.

    It would certainly be possible to write a module to export a node, change its type, and re-import it. But I think that this is a bit of a convoluted solution to a simple problem. Modules can just run a simple query, such as UPDATE {node} SET type = 'newtype' WHERE nid = 'x', to do this. Perhaps this belongs in CCK?

    # support complex rules for determining whether the node to be imported already exists (e.g. user/client can specify which field or fields identify uniqueness for a particular node type)

    Yep - I plan for much of this to be built-in.

    # when importing, permit some discovery of similar nodes
    # support multiple actions when a node to be imported already exists (e.g. user/client can specify which new fields overwrite which old fields in the already-existing node)
    # implementing modules could support multiple options for the import process: import without question, show a form containing every node to be imported, with ability to check off/on each node. in other words, hopefully you would expose hooks into the middle of the import process to permit intervention by user.

    These things seem like more advanced features to me, and I haven't really considered them as yet. But I'll give them some thought.

    Jeremy Epstein - GreenAsh

    Jeremy Epstein - GreenAsh

    Importing from Word, Open Office

    jasonwhat's picture

    I'm curious if this import/export api is only for importing several nodes using data sets (like excel) or will also help with doing things like importing from Word or Open Office. Word 2007 allows importing of xml schema for exporting Word docs.

    I think a module exists that brings in content from Macromedia contributor, but I think adding functions like "Import from Word" could make Drupal much more business friendly and more widely adopted. As much as the RTE's and a cleaned up node submission form help, I still find people are just more comfortable doing what they know (in my case I often have them email submissions), and unfortunately, what most people know is Word.

    Errr

    Jaza's picture

    This is certainly not something that I'll be working on personally - but the API allows module developers to write their own import and export engines for different formats. So once the API is ready, someone could write a module (that depends on the API) that implements, for example, an import engine for the native OpenOffice Writer XML format. I don't know about importing a Word document directly - Word documents are stored in a closed-source binary format, so that would be a lot harder.

    The API is certainly geared much more towards importing large sets of data, from a nicely structured plain-text listing. Its main aims are things such as easily facilitating data structures that have relationships (i.e. references) between each other, and allowing custom mappings for field names. Once a Word document is converted to HTML, and the HTML is tidied up, the whole document would simply become the body of a node. This is just one field, in one table, in the database. So, if you wanted to import a Word document, you'd find that the API's features don't really help with it, as it's a very different problem to the ones that the API is trying to solve.

    Jeremy Epstein - GreenAsh

    Jeremy Epstein - GreenAsh