DocImport API module

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
agentrickard's picture

[Note title change to reflect discussion below.]

Moved to official ideas list at http://drupal.org/node/236461

I did a little searching and found that the issue keeps being raised.

Attitudes regarding Microsoft aside, I think it would be a worthwhile project to create a true Word-to-Node import module. The basic features of which would be:

-- An upload page that lets the user select a Word (or similar OpenOffice document) for upload/import.
-- A setting (both default and user controlled) for mapping the document to a node type.
-- On upload, strips the character data from the document, cleans it, and creates a new node.
-- Takes the user to the node edit page for the newly created node.

The conversion could rely on an external library for the text transformation rather than a Drupal-native parser.

And, no, I do no think that TinyMCE is a solution to this issue.

The module should also be extensible to other types of "foreign" content conversion. Notably PDF, Excel, PowerPoint.

Comments

Yes please

gdd's picture

This would be a great addition, especially if it was written in such a way that you could plug in the convertors you wanted. Everyone with a non-technical editorial team fights this battle constantly. +1

I am one of those who raised

Tistur's picture

I am one of those who raised it recently. I toyed with the idea of submitting a (pre-)propsal based on it (I'm a university student), but could not figure out how I would implement. As you can see from my post (also linked above), I played with wvWare. Is that too limited for a general module? I keep going back-and-forth on it - it's GPL, and is probably already on many shared hosts (it's on at least one of my hosts), so maybe I'm worrying too much?

I've done Word to Drupal

KarenS's picture

I've done Word to Drupal conversion with help from modules like http://drupal.org/project/htmltidy to clean up the code. There are probably other modules that would help, so incorporating or extending or improving on them could be a part of a solution. MS Word html is so ugly that even HTML tidy can't completely do the job, though. You probably need to do some clean up before HTML Tidy will work.

The wvWare html is ugly too,

Tistur's picture

The wvWare html is ugly too, but not nearly as ugly as Word html (is anything?). Maybe a combination would work.

We used the same + a JS "on

michaelfavia's picture

We used the same + a JS "on paste" plugin for tinymce or fckeditor to strip out complex useless word markup and for WYSIWYG functionality. this was still a troubled path for us though by nature.

Not HTML

agentrickard's picture

@KarenS

I am not talking about working with Word's HTML export.

I want to press a button, suck in a Word document, and have Drupal create a node based on the content.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

I agree that this would be a

greggles's picture

I agree that this would be a great way to get content into Drupal for non technical users and also to get legacy content from "that windows file share that everyone uses as the intranet" into Drupal. However, if the importer loses their formatting along the way then it will not be useful. In order to retain their formatting I think that a doc2html solution will be required and then an html to clean html solution as well.

So the flow is something like

word -> html
html -> drupal_variable
drupal_variable -> cleaned_drupal_variable
cleaned_drupal_variable -> node

There are doc2html converters that can help with this, but they rely on command line. There are probably php based doc2html converters as well, though what I found in a quick search was not inspiring.

--
Open Prediction Markets | Drupal Dashboard

It seems like a combination

gdd's picture

It seems like a combination of wvWare and Tidy could be brought to bear on the problem. wvWare does the doc->html and Tidy could then clean it up. You could also (I assume anyways) extract a bunch of the metadata from the word doc and use it for the node's fielded data - author, original create date, title, extract embedded images and do something with them (either inline them in the html or turn them into imagefields). I'm stoked about the prospects.

I TOTALLY agree

zoon_unit's picture

Although I think this would be best implemented as a button in Drupal 7's wysiwyg editor.

A module would be a great start though. The editor could make a call to the module, but then users would have the choice to use the module by itself or in conjunction with the editor.

do you mean WLW (windows live writer) ?

garthee's picture

I think it does the job
Click a button (well you can even choose the terms) it will create the content.. -> html is tidy enough

I would rather want to have an enhanced moveable type api (xmlrpc) to provide functionalities to choose styles and do other chores

Garthee
http://www.theebgar.net

XMLRPC

Matt V.'s picture

I haven't tried it myself, but I saw a blog post a while back about "Using Microsoft Word to Blog Posting with Drupal" via XMLRPC. Based on that post, it sounds like the technique is limited to blog posts and to MSWord 2007, but maybe this project could expand on that basic idea.

WLW is somewhat better :)

garthee's picture

I have been using WLW to post all ordinary articles (with images and taxonomy terms) to my site (www.theebgar.net) and it is done through xmlrpc (you have to select moveable type api in WLW).

It is not only for blogs.. but for any content types.. and the best thing of it is you don't have to upload the images.. once you set the ftp location it will create a folder for each post within it, upload and make the link in the post. An excellent too from Microsoft for free (unexpected from MS huah :) ), and of course MS word will need a license.

I would suggest improving xmlrpc.php to support more features and, in my opinion, this is trivial for an SOC project, if not I would love to take that job my self. And perhaps on the client side, if an api or plugin support available (which i m not sure) we could extend the compatibility of WLW and the features, which will be an excellent tool for a non-technical users (well WLW has to be configured once)

XMLRPC and the non-technical

agentrickard's picture

Right, the key here is "non-technical users."

The push-button mechanism described in the original post is ideal because it makes sense to end-users. Perhaps the form processing could be routed through existing XMLRPC handling, but the end-user should never know that this is the case.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

I might suggest hooking it

dldege's picture

I might suggest hooking it up with the Services module instead of the core XMLRPC. I think the services API is more robust and I wouldn't be surprised to see some form of services moving into core in the future.

Dan DeGeest
Software Developer
Somewhere or Another

What is wrong with just

Tistur's picture

What is wrong with just uploading via http? Could someone explain to me what benefits using XMLRPC/SOAP/REST/Services Module would have over simple uploading?

Sorry for being dense.

Nothing

agentrickard's picture

I do not think that anything is wrong with HTTP. I believe that people are saying they already solved part of the problem using XMLRPC.

So leveraging existing code would be a good thing.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

Services would abstract out

dldege's picture

Services would abstract out that layer and also let you switch servers if that was desired. What I was suggesting was using Services XMLRPC vs. the core XMLRPC (which Services uses under the hood). But since you write it for services the API calls work with all the other server formats (AMFPHP, SOAP, JSON, etc.).

Just and idea - I don't totally get the whole scope of this project so take it as more information to use as needed.

Dan DeGeest
Software Developer
Somewhere or Another

Scope

agentrickard's picture

The scope is quite simple, it's the implementation that's hard.

Use-case:

  • User has created a Word file (or similar document, but Word for now).
  • User goes to Drupal, clicks link to 'add content from Word file.'
  • User is presented an upload form with minimal instructions.
  • User selects and uploads file.
  • File is converted (by whatever means) to decent text/html content.
  • Content is placed in the body of a new node.
  • User is taken to the edit page for the new node.

How the magic happens between steps 4 and 5, I do not care to prescribe.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

as long as we're dreaming,

greggles's picture

as long as we're dreaming, what about batch imports based on a directory local to the server?

i.e.

  • User chooses directory on the server where the files have already been uploaded (FTP, SCP, windows file share)
  • File is converted (by whatever means) to decent text/html content.
  • Content is placed in the body of a new node.
  • Use Batch API if necessary
  • User is shown a list of these newly imported nodes with links to view or edit them

--
Open Prediction Markets | Drupal Dashboard

nice!

catch's picture

Could save thousands of people trying to build home-grown intranets on the cheap from RSI that one.

Nice but...

agentrickard's picture

I would call that scope creep. Batch processing is a great goal for a phase two.

I spent much of last summer controlling scope on my project (FeedAPI) so that the student could finish the first cut at a valuable contribution.

If we pile too many requests onto a SoC project, we run the risk of failure. Worse, we could alienate the student from the process.

So my point here is this, the requirement should be:

-- Architect the solution to support extensions
-- Be aware that one extension would be batch processing of file directories

If time permits, certainly, a project should have a list of back-up features to implement. But batch processing by itself is perhaps too much to ask for in the first pass.

Now, you might make a case that batch processing of file directories into nodes is a separate SoC project, but you should post that in another thread...

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

For the magic - our company

dldege's picture

For the magic - our company is currently working on this exact same problem but with PowerPoint files. We have written a backend server application in .NET that handles the PowerPoint import (splits the PowerPoint into deck and slide nodes that are related to each other via some additional custom tables). In our implementation we have the import process directly reading/writing to the Drupal database but this is the part I was suggesting could be moved out to Services. Some back end process is watching for new "word" nodes or some table where a word doc (the upload) is related back to a new node (which is currently more or less empty). The application opens the document, does the necessary conversion of the content and uses services to do a node.save of the existing node updating the body content, and whatever else with the converted text. A lot more then just body text could be implemented. Since the application uses services in this model you have more flexibility in which OS, language, technology, etc. you use for the import processor - but main point, it doesn't have to be a Drupal/PHP module. In addition, the document conversion could take place client side in a Flash/Flex (SliverLight might be appropriate here for the .NET/Word angle) and the converted content put into a node submit form that way or sent via services.

Anyway, more food for thought.

Dan DeGeest
Software Developer
Somewhere or Another

I agree that this would be useful

echoditto_tom's picture

Something as simple as an input filter that applies these regexes would help quite a bit.

One approach that I've run into: Microsoft CMS offers a Word plugin that provides templated "print-to-CMS" functionality. This is superficially similar to the XMLRPC method outlined above, but I believe that creating a Word plugin would allow substantially more control (as well as permitting the component to be packaged in a tidy installer that users could easily deploy). Some related reading (I have to admit it's been a pleasantly long time since I've touched Visual Studio):

Word and COM Add-ins

Working with Office command bars in VBA

In .NET you can do this

dldege's picture

In .NET you can do this without much or any knowledge about COM via the interop assemblies for Word (or all of Office).

http://www.windowsdevcenter.com/pub/a/windows/2006/04/18/programming-wor...

Another potential option is to use OpenOffice as the back end converter - we tried that with PowerPoint without much luck but its definitely possible.

Dan DeGeest
Software Developer
Somewhere or Another

Is it just because WORD is more popular

garthee's picture

I can't still understand what is all the fuzz about WORD here. Why cant it be WLW with more features (after all it is free unlike word) and plugin support exists. Is it just because WLW is new and we haven't used it much? Unfamiliar tool cannot be inferior to a more familiar but sluggish tool in any way.

If word provides more features, some are not appropriate or impossible transfer to a node, like a complex design work with text boxes, borders and box layouts.

Drupal is not Sharepoint!, all what we are looking is ways to allow users to post without technical knowledge, from a word like environment, with a BUTTON CLICK and to allow users to pick (manually / automatically) the parameters (menu / term / published / authored on) relevant to Drupal and I am trying to workaround that, by enabling as much as options possible and currently menu is the only part failing. Well regarding technical configurations related, we can propose a plugin / script to provide auto configuration for Drupal as there exists for Wordpress (perhaps wordpress community readily welcomes WLW).

I am really sad to see an awesome tool becoming really popular in blogger community being overlooked here at Drupal!

wlw wtf?

jpetso's picture

For unixoid people like me who had no idea what WLW is (WLW 700 AM radio station? WLW b2b directory?) - that means Windows Live Writer. Which is free, but still under a proprietary license. Also, it supports a number of blog APIs including Blogger API, MetaWeblog API and Movable Type API, all of which are also supported by Drupal core's Blog API, so blogging from WLW should pose no problem as far as I can see. And in case something goes wrong there, you can still use Ecto, KBlogger, Gnome Blog or whatever. So I'd say blogging from desktop clients is a solved problem.

The issue here is not support for a specific application but support for importing document formats. I can see the business need to do Word documents as those are in wide usage all over the company landscape, and supporting blog APIs doesn't solve the issue of importing and managing those documents that are used throughout company workflows.

However, I find it disturbing that Drupal, as an open source project, pays as little attention to supporting open standards. The original proposal mentions "or some similar OpenOffice document", which is wrong per se as OpenOffice documents don't exist anymore, instead that format has been replaced by the OpenDocument format which is used across office suites (most prominently in OpenOffice, KOffice and Google Apps) and has been granted ISO standard status for quite some time now (as opposed to various other microsoftish formats, although that might still change).

It is my opinion that open source, open standards and open knowledge projects should work together towards the goal of openness everywhere. I can't help but to be seriously annoyed every time when people preach open source and open standards for a single domain (which would be Drupal, HTML and JavaScript in this case) but have no problems whatsoever for passionately supporting proprietary stuff, like MS Word, iPods and iPhones, or Flash, where it doesn't directly affect their business plan.

Imho, "Word import" as title for a Summer of Code project is seriously wrong. It should much rather be "Office format import" with either .doc or OpenDocument formats as possible implementations of those office formats. There's a lot of good reasons to love the Drupal community, but the lack of sense for the importance of an overall free ecosystem (with "free" not as in free beer but rather free speech and open source) sometimes makes me cringe.

Thanks for the reminder...

Shai's picture

+1 for your proposal for a name/scope change to the project to bring it more in line with an open source ethic. And those changes you propose wouldn't stop the project from solving most people's requirements regarding a solution for importing Word docs.

Regarding the Drupal culture issues. Part of what I like about the Drupal community is that the general tone of the community is so practical: role-up-your-sleeves, let's get work done solving real problems. I think the general lack of a rigid, unified, intense ideological approach to things is something that makes the community more welcoming (and thus bigger and stronger). So I'm not "disturbed" that, for the moment, we are falling short on playing a stronger role in supporting open standards. That said, I think people are really open to the kind of criticism that you bring. Like when I read your post I thought, "Oh yeah, right."

In a community like this people play the role of leader and follower at different times. I think on this issue you are certainly a leader. Thank you.

Shai Gluskin
Content2zero

Agreed

agentrickard's picture

How about DocImport API as a project title?

What I really would like to see is a flexible import API that supports multiple formats. From an immediate business case, Word, proprietary though it is, seems an absolute necessity to me.

However, the project itself should be architected as you suggest -- an open format process that supports pluggable document translators. By default it should ship with open-source format supports -- and I would support that as a project.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

Word would be really useful

dldege's picture

Word would be really useful in an enterprise situation where uses know two pieces of software - Word and Outlook. That's just how big corporate environments are. WLW looks interesting as a blogging tool, I think use case for Word might be a little different. Also, think about legacy documents and information that companies probably are currently keeping in word docs in some network shared folder where it can't benefit from any of the community aspects of Drupal like comments, tagging, collaboration.

Anyway, the technical side of this is interesting to me - not the one specific technology - why not open a WLW proposal separately.

Dan DeGeest
Software Developer
Somewhere or Another

Also google docs and google

dldege's picture

Also google docs and google search has handling for Word documents so you can view them as HTML and/or import them into gDocs. It might be interesting to find out how they do this and/or maybe some sort of module that can work with Google Docs with respect to word files. I see that someone has already figured out how to use blog api and google docs - http://driesknapen.net/blog/google-docs-drupal.

Dan DeGeest
Software Developer
Somewhere or Another

doh! -

Dan DeGeest
Software Developer
Somewhere or Another

Building a proposal...

bradfordcp@drupal.org's picture

I am interested in creating a proposal based on this idea, any pointers or ideas I should specifically target?

~Chris

I think a big part of is

dldege's picture

I think a big part of is doing a little research to figure out how or the best way to do the Word (or other format) conversion on the server side, after that's done the part of creating nodes and so forth is pretty well known. I'd look at OpenOffice, Word interop in .NET, Google Docs, and any other thing that might be doing something with Word conversion

Another option would be a more generic and extensible module that abstracts out the actual format of the file being uploaded and allows for pluggable modules to handle specific content type conversions, Word could be a proof of concept choice for the first extension to the more generic "upload converter" moduled.

Just a couple ideas of the top of my head.

Dan DeGeest
Software Developer
Somewhere or Another

Yes

agentrickard's picture

Another option would be a more generic and extensible module that abstracts out the actual format of the file being uploaded and allows for pluggable modules to handle specific content type conversions, Word could be a proof of concept choice for the first extension to the more generic "upload converter" moduled.

This extensibility is part of the scope and is integral to the design. See the last line of the initial proposal.

The module should also be extensible to other types of "foreign" content conversion. Notably PDF, Excel, PowerPoint.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

No wonder I thought it was

dldege's picture

No wonder I thought it was such a good idea ;) Thanks for pointing that out.

Dan DeGeest
Software Developer
Somewhere or Another

Matt V.'s picture

A recent Joel on Software post gives some background on the Office file formats, along with some ideas on importing and exporting them. I thought it might inspire some ideas here.

Introduction to my proposal

bradfordcp@drupal.org's picture

Here is the introduction to my proposal, a working copy can be found at http://soc-2008.deviantsolutions.net/bradfordcp-proposal. I am currently fleshing out the rest of the proposal, but I wanted to get some initial feedback as soon as possible.

The idea of a document importer for Drupal has many advantages. One of the foremost being experience with a desktop word processor. Switching from using a word processor to simply entering text into a page, or even using something TinyMCE can be daunting. The goal of this project is to allow users to use the already familiar desktop word processing application of their choice to create content to be used in Drupal. The process as I see it, is as follows:

  1. User creates content using their word processor of choice
  2. User saves file within word processor and uploads to a Drupal powered website
  3. A Drupal module parses the uploaded document turning it into a node (or collection of nodes).

The module should NOT be word processor specific. Ideally the module would be able to detect file types / formats calling the appropriate functions to parse that file. Using this methodology file types would not be limited to word processor files, but also presentation and spreadsheet files as well. My proposal has two key components, the first being a "router" which upon seeing a file being uploaded would route the contents of the file to the appropriate parsing functions. The second being the actual parsing routines being called by the router. This method would allow maximum flexibility and extendability for the project. For instance a user uploads a document in Word XML format. The router detects that it is XML, checks an internal registry of which parsing modules support XML input, then sends the data to each parsing module which supports XML. The modules would return whether or not the XML is in a format they support (the spreadsheet parser would return false, while the document parser would return true). After the appropriate parser is found it's hooks are called to convert the supplied data (XML in this case) into HTML. Finally the router would route the parsed data into a node (or collection of nodes). Following the routing approach any number of modules may be created to parse the uploaded data. Should a format not be supported it is merely a matter of creating the appropriate parser to handle the data, all file management is handled by the router not the parser, keeping the code separated and clean.

The main module will be the router with a set of hooks defined for the parsers. I realize the explanation above can be a bit confusing, I am not rolling the Word importer into one module. Over the course of the summer I plan to implement quite a few parsers using the hooks provided by the router.

~Chris

Sounds good

jpetso's picture

Suggestion: Don't duplicate file type parsing, but rather rely on standard mimetypes that can even be retrieved by dopry's MimeDetect module already. Current mimetype databases are pretty complete, I'm pretty sure they already include recognition patterns for all of the binary MS Office formats (.doc, .xls, .ppt, ...), the XML office formats (.docx, .xlsx, .pptx, ...) and the OpenDocument formats (.odt, .ods, .odp, ...). Recognizing mimetypes is a solved problem, you shouldn't come up with a new solution from scratch here.

Also, I think joining forces with existing parsers instead of writing new parsers or XSL stylesheets might get us better HTML output and less maintenance issues. Remember, other people have similar issues and are working on them as well! For example, a quick Google search throws up an ODT to XHTML stylesheet/converter (and other ODT stuff), as well as the OOXML/ODT converter that also works with stylesheets and might make it possible not to touch the awkward XML of the new Office formats at all, but rather transforming it into ODT and from there into XHTML.

Lastly, I think that the original .doc and .xls files will probably be the most requested format to import, as OOXML is still far from being as widely used and deployed as the old binary formats. Especially for those, you need to rely on external converters, parsing those manually is suicide. I remember Abiword to have a flexible command line interface, that could be a feasible transformation method for Unix servers. No idea about how to transform the .xls and .ppt formats, but I guess there's at least some simple converter lying around somewhere. I don't think doing .doc and .xls will be feasible for shared hosting where you can't install non-PHP external converters, but I might be wrong.

Some notes...

bradfordcp@drupal.org's picture

I can see mimetypes being a much faster / efficient method of determining which parser to use. When I mention parsers I don't just mean rolling my own parser. It could also be an interface for an external library. The router calls the parser's hooks and the parser makes the appropriate calls to the external library.
~Chris

~Chris

I like the approach taken

gdd's picture

I like the approach taken here of being able to plug in various parsers. I agree with all of jpetso's comments.

In addition to the styles you listed, I would like to see some investigation into what kind of metadata can be brought into the nodes from the document formats - author, original create date, etc. This will vary by document type but the more data we can get the better. Extracting embedded images from documents would be really helpful, although quite possibly outside the scope of a SoC project. Are you going to just be supporting the base node fields, or are you looking to put some type of field mapping in place?

There is an interesting project I just discovered called Docvert which, when installed, provides a web service for converting Word to HTML. You post documents to it. It requires AbiWord or OpenOffice to be installed on the server which is not ideal but it could be worth looking at

http://holloway.co.nz/docvert/

Keep up the good work.

I agree that you should

dldege's picture

I agree that you should design a schema for tables that your module could provide - it would be fairly generic and some columns might not apply to all import types. But we should capture more then just a title and body text. Author, keywords, etc. are good examples of the type of meta data. Another idea, we do this with our PowerPoint importer - is we keep the original file data in a BLOB in the database but the physical file is not stored in the file system. This gives the ability to download the original document and other use cases. This could be optional if db size is an issue.

A simple schema might just be

table import_module_data
nid, vid, original_filename, author, keywords, filedata, filedatalength .....

and so on.

I also think the router should be mimetype to converter - not an XML converter that then subcontracts out different xml formats to other converters. I like the one to one format/converter setup (even if a single converter gets assigned to more then one mimetype).

What type of node would be created? A new node type - imported_content - are would an existing node type be associated with it - ie, it create a page node and adds the title/body? Or completely user configurable?

Finally, I agree that the binary office formats will probably be the best use case initially - any plans now to convert those?

I like what you have so far keep up the great work - this would be a very useful module.

Dan DeGeest
Software Developer
Somewhere or Another

Interesting points...

bradfordcp@drupal.org's picture

I think there is a bit of confusion when I was discussing XML. That was simply a use case, because of the flexibility it should be simple to plug-in any parser for any mimetype. I agree that the binary formats are more popular and as such that should be the target of this summer. I was using XML as an example, not the end goal. I think the storing the original data is a good idea, but given size constraints (I have seen some pretty big office files) we should store the original file on the server and keep the path in the database.

I was thinking of just using a standard page node as the node type, but the more I think about it, this could be configurable. After the file is parsed and the node is saved the user would be presented with an edit screen for the node. Metadata from the file could be displayed off to the side with a drop down list for the any empty fields. The user would simply map the data to the field and save the node. Earlier today heyrocker suggested using something similar to feedapi_mapper to map metadata to node fields. I will be updating the proposal later on today with some of the suggestions from here, including storing the original data and mapping metadata to node fields.

~Chris

~Chris

There a a lot of advantages

dldege's picture

There a a lot of advantages to BLOBs so don't rule them out entirely

  1. no name collision or file system issues like that.
  2. easy to replicate the db

Also, Ken gave a talk on this, you aren't limited to just the Drupal DB, you could store all the file data in a secondary database or a remote service.

I don't want to complicate the task -which should be convert/import.

So if file system file storage is the easiest go with that - maybe keep in mind a design that allows alternatives.

Dan DeGeest
Software Developer
Somewhere or Another

Application is up!

bradfordcp@drupal.org's picture

My application is in, good luck to others who are applying! To view what I am sure is the first revision, there is a copy posted at http://soc-2008.deviantsolutions.net/bradfordcp-application

~Chris

Mapping fields

jpetso's picture

-Map any metadata to the correct node fields

That's essentially what FeedAPI's Feed Element Mapper does - it can map feed properties to CCK fields or taxonomy terms, and supports a number of CCK fields to be written to.

How are you planning to implement this feature for your project? Create something new from scratch, fork Feed Element Mapper, or factor out this functionality so that it can also be used for your router module?

Addendum

I agree, it sounds like this

bradfordcp@drupal.org's picture

I agree, it sounds like this would be a quick refactor into a parser module using the new hooks provided by a router.

~Chris

~Chris

Leveraging existing code is

bradfordcp@drupal.org's picture

Leveraging existing code is very much an option, prior to actual coding, during the "April 14 - May 26" phase I will look at Feed Mapper and Node Import to see how they are mapping to a node's field's. Should it fit the needs of this project (it sounds like it will) I will factor it out and use existing logic. If it will not do what I am looking for, I will create one from scratch. The table structure will most likely look like:

table import_module_node_maps
node_type   mimetype   field_name   meta_name

By using an SQL statement and a loop the metadata could be loaded in the appropriate node fields.

On another note I think the import module should maintain a table where metadata is stored for a node. For instance if my document has 12 metadaata fields that were extracted and I only use 3 in my node, what should happen to the other 9? There should be a table that stores a copy of the metadata extracted to make it readily available to the user without having to re-parse the document.

~Chris

~Chris

Try to use feedapi and what about indexing?

alex_b's picture

Thoughts on your proposal:

1) The router is what FeedAPI does. Ken mentioned this module further down the thread. Consider using FeedAPI as basis for this. I can help you with that, I am really interested in exploring FeedAPI's boundaries. FeedAPI has a parser/processors architecture that could be very suitable for what you will be building. One problem that I can think of is that FeedAPI creates a node to represent a data feed in the system. We could find out away around this though. The exciting part here is that ideally, Aron will be working on Aggregator for D7 as SoC project. So insights that we win here could go directly into the next generation aggregator.

2) Keep this solution as modular as possible. I think that's what you're doing, but one use case that I'm not seeing covered atm: indexing documents without actually turning them into nodes (those documents could be attached to e. g. blog posts).

FeedAPI

agentrickard's picture

You are right about parsers. Take a look at last summer's FeedAPI project for an example of a pluggable parsing architecture.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

A big what if

skessler's picture

Could we turn this into an off-line editing project for Open Office. What happens if you could download to Open Office node templates and then you were able to upload the content that way. The node template could store information on CCK fields and maybe even be tagged for taxonomy. I KNOW I am dreaming here. Word content can be opened in Open Office and could be easily moved into the Open Office document without degregation of formating or much that is.

Just dreaming,

Steve

Owner and Lead Consultant
Denver DataMan

Dreams are nice, but

jpetso's picture

-1 to scope creep. Let's get the basics right, stuff like this sounds like a much more specialized solution that should probably not be in this SoC project but rather should be done afterwards by someone who cares a lot for this.

While this sounds like a

bradfordcp@drupal.org's picture

While this sounds like a good idea, I think it is a bit too specialized for the scope of this project. It needs to be generic enough to work many different file types not just those supported by Open Office (although it supports quite a few).

~Chris

~Chris

Yes!

SamRose's picture

Consider me signed on to support this. Unfortunately, most of the people I work with need this kind of support. HMTLTidy has been a so/so solution. But, full support would be fantastic

I guess I am not yet clear on exactly what direction was decided upon here (upload import?), but would love to help out.

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

My suggestion.

ruchiram4@drupal.org's picture

I believe I can do this in the following way.

User will be prompted to select a specific file.(we can restrict the selection of only a certain range of valid file types)
One the user selects a file for upload, Drupal will import that file and start reading through the file. For each of the values within the file, Drupal would generate an XML node with the specific value.

Note: I tied to use the general notation for a tag in the XML, But it did not display correctly. So I'm using [] brackets instead of the general tag notion.

(The XML schema to be generated needs to be done such that we can store the file information in a convenient way. Eg: We can store the data of a certain line within a line tag and for a specific style, we can add different tags like:

[line]
[b][i]Hello World![/i][/b]
[/line].
If there is a table, we can store the values like :
[table]
[row]
[column]Hello[/column]
[/row]
[row]
[column]World![/column]
[/row]
[/table]
)
So after reading through the complete document, Drupal will eventually have an XML document which would contain the necessary data in the file.

Then, Drupal would have a display function which would read through the XML and then display it in the Web App.
The reason behind the generation of XML document is that, we can extend the same application for other formats of files. We can have the XML creater to deal with the various files types but generate a general XML schema for all. So the displaying part is decoupled from the various differences of input file types.
Or in other words, the front end XML creater part will need to know how to deal with different types of file types, and then the displaying part would be the same for all file types.
In this way, we have the capability of extending the types of files which we can accept without touching the display code. I believe this would result in flexible design and open for extensions with other different file types.

Cheers!

using xml to store the meta

dldege's picture

using xml to store the meta data and content of any given file format does provide a fairly generic and extensible way to implement the storage on the Drupal side however, it has several drawbacks taht would favor storing content and meta data in the database.

With the imported data in the database as rows columns in an efficient schema it will be much easier to tie into to drupal search, field mapping will be easier IMO, and other advantages to not having to read and parse xml everytime you want to get at the document data. Also, (correct me if I'm wrong) Drupal 6 doesn't really ship with a nice DOM based XML parser, xpath, etc. meaning you immediately have an outside dependency on an XML library or PHP 5 only support.

I think meta data can be stored in a pretty simple table that is akin to an xml document that would define such a schema.

Dan DeGeest
Software Developer
Somewhere or Another

XML vs. No XML

SamRose's picture

I can see merits in both ideas, both for short term and long term.

XML would be a great way to parse and deal with many, many data types coming in (spread sheets, ms access databases, etc etc)

But, I can also see the point that Drupal does not have any native way to parse XML (or does it?)

I think that XML is worth exploring, because so many desktop and web applications have the ability to output XML files, and the direction I think ruchiram is going in is to be able to let Drupal do something with many different types of files, which could be very useful.

But, I can see the point that we don't want to necessarily require people to add extra libraries to their servers to be able to handle XML. Personally it's not a problem for me, but could be limiting to others.

What would be really cool is if we could get everyone on the same page, and come up with one really useful core basic way to import files and turn them into nodes. This way, we won't suffer the "cambrian explosions" of the past, where many people come up with many different ways to solve some parts of the same problem, only to see many of them die off over time, and one emerge to be the widely supported way. I mean, I like complex adaptive evolution. But, we can guide it a bit if we really try :-)

This is kind of related to http://groups.drupal.org/node/8938 and it could be cool to merge efforts, because a lot of people are now thinking in this direction. Would be cool to make some kind of "glue API" that can serve as a widely-supported core engine going forward. Maybe then XML could be an option, and so could meta data to database. But this way the core data-importing functionality is universal to all efforts. Just an idea

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

He's suggesting storing xml

dldege's picture

He's suggesting storing xml in the database if I read it correctly. I'm saying that some table schema could capture the same information and be as flexible. If we want to export out xml or anything else that would be doable from a good database design.

Dan DeGeest
Software Developer
Somewhere or Another

XML Parser issue

ruchiram4@drupal.org's picture
   I was thinking of storing the data as a string of XML in the Database and then using XPath to filter the data. But now I see that there would be an issue if Drupal does not come with an XML parser. So would someone kindly inform me whether Drupal has such a parser as I couldn't find it myself. If not, would it be OK to add an external library for the project?
  But anyway, I did some reading on converting an XML schema to a database schema and found out that I could instead use a combination of database tables to represent the same thing. So in both cases, I can come up with a solution.
  Looking forward for your comments.

No XML parser needed

jpetso's picture

PHP 5 comes with SimpleXML, any new modules should make use of that one if they don't specifically come with PHP 4 compatibility requirements. Drupal doesn't need an XML parser, better make use of what's already there.

Based on how I am perceiving

bradfordcp@drupal.org's picture

Based on how I am perceiving things, could this module grow to include exporting as well? Thus making nodes available in a variety of formats (doc, pdf, odt, ...). It sounds as though it wouldn't be too difficult (as it appears some external libraries will be involved in parsing anyway). Does it sound like this could be part of the scope or not?
~Chris

~Chris

Export

agentrickard's picture

There are already some export handler modules in contrib -- including a Word/Excel export for Views. Import and cleanup is the bigger issue.

--
http://ken.therickards.com/
http://savannahnow.com/user/2
http://blufftontoday.com/user/3

It would be

SamRose's picture

It would be interesting/informative to compare what these modules do, or propose to do, to see how different or similar patterns they possess really are.

Also, thanks for reference above to SimpleXML.

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

What about doing it this way:

maciek's picture

Give user form "import office document", and then allow document be processed as drupal node in drupal editor?

IMO this may support at least both ISO formats (ODF and Office OpenXML) as well as old binary format of Word: there should be general interface for uploading and general interface for editing, and set of OFFICE->DRUPAL converters (sharing their API, so that system will be fully extensible)?

This is the single biggest

WorldFallz's picture

This is the single biggest thorn in my side right now. Until a true wysiwyg editor with user friendly image handling for Drupal comes along (and oh btw it has to look and work exactly like word because OMG if people have to learn a different icon it's a problem)-- I won't be able to gain mass acceptance for my drupal information management solution. It doesn't matter what modules I add or how amazingly I configure it, enterprise users simply want to import word docs. period.

And there's no training, discussion, configuration, argument, proof, cajoling, begging, etc, that can stem that desire. I could design a site that would serve people coffee at their desks every morning and afternoon-- and the only question they would have for me after the demo would be "Can I import my word docs?".

Plus, as a side benefit, it fixes the composing content offline problem. Just compose away and import when you get online.

I'm no developer, but I'm HAPPY to assist this effort in any way I can-- usability, documentation, testing, etc. Just name it.

Accepted for SoC 2008!

bradfordcp@drupal.org's picture

This project has been accepted for SoC 2008, check out the wiki page at, http://groups.drupal.org/node/10890

I will be keeping notes and updates there as well as on my test site.

~Chris

Awesome! But i get "access

WorldFallz's picture

Awesome! But i get "access denied" when I click on that link....

Sorry, I set it to private

bradfordcp@drupal.org's picture

Sorry, I set it to private accidentally, it should be available now.
~Chris

~Chris

similar XWiki project

Matt V.'s picture

I was browsing some of the non-Drupal SoC projects and noticed an XWiki project titled XWiki Office import that sounds very similar to this one. XWiki is java based, so it's unlikely there would be overlapping code, but I thought it might be good to know about for the possibility of comparing notes.

There are quite a few

SamRose's picture

There are quite a few existing tools in Java for importing MS Office stuff. Some of it is also employed in Alfresco CMS, too, if I am not mistaken.

I don't know how much of it could directly inform what is discussed here. Although it is indeed similar that that XWiki and other Java tools are parsing XML, and that is also the solution being discussed here.

I think there may be some synchronization with Feed API folks, that will help give drupal the ability to parse and deal with different XML. Again, this is also discussed here http://groups.drupal.org/node/8938

Feed API seems like it could be the "XML" parsing engine for Drupal, potentially giving Drupal basically many of the same capabilities as XWiki

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

Input

perandre-gdo's picture

I would love to see this as a tool for easy publishing of scientific papers. Two useful features would be some kind of integration with the books module, and the possibility to import footnotes (css/js-solutions for showing footnotes exists...out there).

This is a very exciting project! I still get Access Denied for the wiki page, though.

You should be able to see

bradfordcp@drupal.org's picture

You should be able to see the wiki page now, sorry for the confusion.

~Chris

~Chris

Academic/Scientific Papers

SamRose's picture

perandre,

Tell us more about your ideas/use case for academic/scientific papers. I have quite a few people connected to P2P Foundation who would be interested in this as well. I may be able to get some folks interested in this (possible even interested in future funding...)

You are looking for a way to import word docs that are scientific papers, complete with footnotes, etc? How about other formats? DocBook? LaTex?

(I have worked with others to create a Perl-based wiki a while back, that does a nice job of academic paper layout, all through css (and a footnote wiki syntax) see http://socialsynergyweb.net/cgi-bin/wiki/GrantWriting/FrontPage for an example. Of course this does not import the documents.)

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

@ samrose: Sorry my late

perandre-gdo's picture

@ samrose: Sorry my late reply; I need to figure out the notification system :)

To start with, I would be very happy to see a module that could handle Word's paging, headings, indents and footnotes in a meaningful way. My target is primarily master students. Ideally, a (non-techie) user should be able to upload a document, and wait for Drupal to do the magic. An easy to understand form could be included if helpful.

Your work with presentation is very helpful! I sure hope this will get some attention...and that a lot more papers out there will be more googlable :)

Yes Please!

chrism2671's picture

This (hypothetical) module is exactly what I am looking for. I have been considering writing it but don't have the time do it at present, but I have done some research.

At present, it's quite tricky to convert server side. Before I go any further, and seeing as this is a Google Summer of Code project, I'd like to point out that Google have successfully implemented this feature in Google Documents and it works perfectly. I would like to see Google open up document conversion as part of it's API.

In so far as making it work at present, I have the following

  • Looking at it from a server side upload perspective, you can use ooconvert in conjunction with openoffice to do command line conversions of documents in about 120 formats, including spreadsheets and presentations, to any other format, including HTML and PDF. Sadly, this requires OpenOffice to be installed (100s of MB), which in tern requires XServer to be installed, and well, this isn't really the solution we wanted.
  • Using the TinyMCE, word documents can be pasted. Using the 'cleanup HTML' button, this makes the document format perfectly. Now this is a client side JS solution, but seeing as it works, it could be hacked to work server side.

Sadly, neither solution is particularly great. I would propose one of the following:

  • Google exposes its document conversion system as an API we can all benefit from.
  • Somebody persuade one of the OpenOffice guys to make a more friendly + compact API for this purpose

Either would be acceptible, with second one being acceptable for security conscious organisations and intranets.

Word is Word for a reason

Starbuck's picture

I'm sort of surprised that there is so little recognition of exactly why Word is the popular product that it is. Let's set religion pro or against Microsoft aside, acknowledge and loathe their questionable business practices and aspirations to world domination, blah blah blah. Like it or not, they've created or assimilated products that are on millions of desktops, so interoperability with these products is a good thing for any platform like Drupal.

As to why it's a good idea to target Word one of the DocImport API sources: On one hand I'll re-iterate the mantra that probably 80% of people only use 20% of the features, so most people who use Word still don't get what Word is. On the other hand, Word indeed is much more than a bloated TinyMCE. The HTML it exports is horrible but there are many features worth capturing, and HTML isn't even the path to doing so. One of the key features I see applying to this project involves Word's ability to organize content into Chapters and Sections, and to annotate text for a Table of Contents, Index, Footnotes, etc. It would be really great for those of us who use Word for writing real books to be able to SaveAs and have the data exported into a Drupal Book (really a more complex node type) without losing the structures.

I've done a lot of work with the DOM in Word, Excel, and Outlook, and I've written a product that generates Excel XML according to the SpreadsheetML schema. Documents conforming to that schema can be read by OpenOffice and Google Docs. The point here is that by targeting these open schemas, you can import and export a large amount of the complex functionality - not just text but rich text and document organization, and not just CSV in XML, but rich document formatting which can be transformed from basic columns and rows to something more appropriate in the context of browser-based navigation.

I see from some of the comments here that there is support for Windows Live stuff and discussion about blogs and other specific applications. But I see nothing wrong with a project like this DocImport API which addresses the "other" side of Drupal - the side focused on Drupal as a content management system, any kind of content, and not just another tool for the limited application of blogging. So it's good to see the scope isn't being limited to Word or Blogging, and it's also good to see it's not being opened up via "feature creep" to importing and exporting all kinds of content from anywhere. From what I've read I like the scope of this project and I hope it doesn't change too much.

Ching Ching - 2 worthless cents from my ever-devaluating dollar...

No updates for a while on

justin.hopkins's picture

No updates for a while on this node, but http://drupal.org/project/docapi Thank you bradford!