Document Import Module

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by bradfordcp@drup... on April 21, 2008 at 11:15pm
Last updated by bradfordcp@drup... on Fri, 2008-08-08 14:21

Project information

Project page on drupal.org: http://drupal.org/project/docapi

Current status: Coding Workflow

Description

The goal of this project is to create a plugin based import module for Drupal that allows the upload of office suite file formats which would be parsed into Drupal nodes. Allowing even novice CMS users to generate content using a familiar office productivity suite. Another perspective is that of an organization using Drupal for the first time and trying to migrate existing data into it from a library of office files.

Project schedule

April 14 - May 26
-Create the specifications for the router.
-Generate the table / registry structure as well as determine which hooks will needed for parsers.
-Determine which triggers and actions to provide for automatic creation
-Get more familiar with document types to be implemented over the course of the summer.
-Get community feedback as to what file formats to support along with which styles as well.

May 26 - June 3
-Code the router
-Test the router, verify the appropriate hooks and parsers are being called for any given file
X-Stub out parsers being created over the course of the summer

June 3 - July 7
-Code Word Processor parsers
-These should focus on Office Word XML 2003 & 2007 as well as ODT
-All styles listed above should be implemented at the very least

July 7 - August 18
-Code Spreadsheet parsers
-These should focus on Excel formats as well as ODS
-At the very least tables should be returned
-If a cell contains a function, the appropriate value should replace it in the HTML table.

Status updates

First Report

2008-04-21

Task summary:

Began work on SoC prerequisites

To the drawing board

2008-05-14

Task summary:

Framework design has begun, look for a more in depth update in the next week.

Humble Beginings

2008-06-03

Task summary:

Setup new dev environment (I switched to my new dev machine and had to get used to a new OS (all is good now))
Began coding router from my designs (I will try and upload some of these ideas soon)
Almost finished an ERD for the modules table structure. It seems pretty solid now, but I am still working out the last few kinks, before uploading it.
Catching up on D6 changes / researching mimedetect and the upload system

Todo:

I am a bit off schedule right now, but I should get everything squared away this week.

Clear guide rails make for a smooth ride

2008-06-11

Task summary:

Stubbed out plugin module
- Created data structures for various sections of the parser (these will be universal across parsers).
- Fixed namespace errors in parser
- Fixed data structure errors (using objects instead of arrays)
Began work on the module's settings page (main) as well as the plugin settings page.
Began coding the workflow of the module.

Todo:

Code workflow, this along with the hooks is key to making this module usable. A LOT of time will be spent on UI and interaction with Drupal.

2008-06-25

Task summary:

Started work on workflow, realized that I need to have plugin detection in place before this is possible
Started work on plugin detection, added new field to the db corresponding to the module which has the plugin implementation/li>
Added update logic (I have not used this before) to the module so it will not break across versions.

Todo:

Finish plugin detection and workflow

2008-07-01

Task summary:

Working on workflow with JPEG skeleton
Finished plugin detection software, Simply enable the plugins module and on the next cron (or when manually called)
Added update logic (I have not used this before) to the module so it will not break across versions.

Todo:

Finish workflow

2008-07-16

Task summary:

Rewrote file handling to use file.inc methods
Retooled a couple of DB tables as the infromation stored is a mirror of the Drupal's file, which led to reworking a lot of the validate function
Began work on mapping node fields to document fields (metadata)

Todo:

Finish field mapping

2008-08-08

Task summary:

Successfully mapping document fields to their associated metadata
Fixed DB tables, I was missing a couple of fields
Working on creating nodes from documents in the library

Todo:

Finish generating nodes from library

Discussion Link

http://groups.drupal.org/node/9929
This page is where the initial idea was proposed along with initial thoughts and discussion.

Other Information

For now this is a place to pull together some information from various sources. I will organize this a little later, for now I just want to have a central location for information related to this module.

From my app:

PROJECT DETAILS:
Features to Implement:
-The Router module
-Create a set of hooks other modules implement to parse the various data types
-Allow the parsers to specify mime types that they are interested in and keep an internal registry of those types
-Handle files put to the server routing the data appropriately.
-Create the appropriate node(s) for the uploaded data
-Map any metadata to the correct node fields
-The Parser Modules
-Microsoft Office
-Parse the .doc file type provided by various versions of Word.
-Parse the .xls file type provided by various versions of Excel.
-Open Document Support
-Support the various file types under Open Document
-Implement parsers for ODT, ODS, and possibly ODP

Styles to Support:
These should be the minimum styles supported by any parser
-Lists with the appropriate formatting / styles, (ul, ol, dl)
-Bold text (strong, b)
-Italicized text (em, i)
-Underlined text (u)
-Paragraphs and breaks
-Basic symbols, certain symbols should be supported (and encoded) by default (&, copyright, tm, etc)

DELIVERABLES:
1. The router module with appropriate hooks and documentation to aid parser developers
2. Core parser modules, these should handle .doc, .xls, .odt, & .ods. I believe this will provide core functionality as well as a point of reference for any further parser development.

** EXTRAS **
Should there be extra time I would like to implement the following:
-Code Presentation parsers
-Explore more formats other than Microsoft Office and Open Office (PDF?)
-Create a batch processor allowing the upload of many documents (possibly an archive) to be parsed all at once.

From dldege:

I'm just suggesting the document import and node creation can be two different steps.

Here is a possible workflow for manual mode.

1. User uploads a new word.doc file.
2. Document is parsed and data is stored into the document repository.
3. In the admin settings the user has specified that the node type 'page' is allowed to map fields from imported documents.
4. User goes to node/add/page
5. This node add form is altered by your module to present the user with document selection and field mapping form, ui, whatever.
6. User creates the node using a mixture of newly typed in information and/or information brought in from one or more documents
7. Submit - a new page node exists.

Automatic mode
1. User uploads a new word.doc file.
2. Document is parsed and data is stored into the document repository.
3. New document upload trigger is fired
4. The user has configured that on this trigger a new page node should be created with the title set to the document title and the body should contain the imported document content.

From jpetso:

Don't duplicate file type parsing, but rather rely on standard mimetypes that can even be retrieved by dopry's MimeDetect module already. Current mimetype databases are pretty complete, I'm pretty sure they already include recognition patterns for all of the binary MS Office formats (.doc, .xls, .ppt, ...), the XML office formats (.docx, .xlsx, .pptx, ...) and the OpenDocument formats (.odt, .ods, .odp, ...). Recognizing mimetypes is a solved problem, you shouldn't come up with a new solution from scratch here.

Also, I think joining forces with existing parsers instead of writing new parsers or XSL stylesheets might get us better HTML output and less maintenance issues. Remember, other people have similar issues and are working on them as well! For example, a quick Google search throws up an ODT to XHTML stylesheet/converter (and other ODT stuff), as well as the OOXML/ODT converter that also works with stylesheets and might make it possible not to touch the awkward XML of the new Office formats at all, but rather transforming it into ODT and from there into XHTML.

Lastly, I think that the original .doc and .xls files will probably be the most requested format to import, as OOXML is still far from being as widely used and deployed as the old binary formats. Especially for those, you need to rely on external converters, parsing those manually is suicide. I remember Abiword to have a flexible command line interface, that could be a feasible transformation method for Unix servers. No idea about how to transform the .xls and .ppt formats, but I guess there's at least some simple converter lying around somewhere. I don't think doing .doc and .xls will be feasible for shared hosting where you can't install non-PHP external converters, but I might be wrong.

From alex_b:

The router is what FeedAPI does. Ken mentioned this module further down the thread. Consider using FeedAPI as basis for this. I can help you with that, I am really interested in exploring FeedAPI's boundaries. FeedAPI has a parser/processors architecture that could be very suitable for what you will be building. One problem that I can think of is that FeedAPI creates a node to represent a data feed in the system. We could find out away around this though. The exciting part here is that ideally, Aron will be working on Aggregator for D7 as SoC project. So insights that we win here could go directly into the next generation aggregator.

From CrashTest:

http://drupal.org/project/search_attachments there may be some functionality here

Attachment	Size
docapi_erd.pdf	27.6 KB
DocImport_Process2.png	151.82 KB

Comments

Thoughts...

Posted by SamRose on April 25, 2008 at 3:48pm

Also, I think joining forces with existing parsers instead of writing new parsers or XSL stylesheets might get us better HTML output and less maintenance issues. Remember, other people have similar issues and are working on them as well! For example, a quick Google search throws up an ODT to XHTML stylesheet/converter (and other ODT stuff), as well as the OOXML/ODT converter that also works with stylesheets and might make it possible not to touch the awkward XML of the new Office formats at all, but rather transforming it into ODT and from there into XHTML.

Hmmm...I'd rather pitch in and help create a way to directly parse MS Office XML than to try and convert to ODT, convert to XHTML.

Main reason being that MS Office to ODT conversion has a spotty record for producing desired results. The people this is targeted towards won't really understand why sometimes it "doesn't work" (because of Office>ODT>XHTML transformations). (that is, if I am understanding this correctly)

I am all for re-using and saving time and energy. The point here is there is nothing really good out there that does direct MS Office to valid XHTML conversion. I think the XML route proposed above has promise. I don't think he should throw a round-about, half-working solution into the mix if a better one can be had by building it.

Lastly, I think that the original .doc and .xls files will probably be the most requested format to import, as OOXML is still far from being as widely used and deployed as the old binary formats. Especially for those, you need to rely on external converters, parsing those manually is suicide.

This makes sense. It would be wortwhile to at least investigate what exists for parsing OOXML

No idea about how to transform the .xls and .ppt formats, but I guess there's at least some simple converter lying around somewhere. I don't think doing .doc and .xls will be feasible for shared hosting where you can't install non-PHP external converters, but I might be wrong.

I think you are mostly right about this. It's generally not very easy, and many cases not possible, to install many of them without root access. Perl and Python modules are an exception, as there usually is a way to install them.

I think there have been at least a few people who've taken a crack at php Word to XHMTL converters. We should probably look around and see what is out there...

The router is what FeedAPI does. Ken mentioned this module further down the thread. Consider using FeedAPI as basis for this. I can help you with that, I am really interested in exploring FeedAPI's boundaries. FeedAPI has a parser/processors architecture that could be very suitable for what you will be building. One problem that I can think of is that FeedAPI creates a node to represent a data feed in the system. We could find out away around this though. The exciting part here is that ideally, Aron will be working on Aggregator for D7 as SoC project. So insights that we win here could go directly into the next generation aggregator.

I hope that this does work.

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

Sam Rose
Hollymead Capital Partners
P2P Foundation
Social Media Classroom

Post from maling list

Posted by allisterbeharry on May 26, 2008 at 2:16am

Maybe you saw this already: http://lists.drupal.org/pipermail/development/2008-May/029856.html

I have currently unpublished PHP code I've been using for a few years to
access OOo spreadsheets and load them into Drupal. However, it was developed
"ad hoc" to import product data, and is currently not fit for general use
since it is limited to spreadsheets and only goes from OOo to Drupal, not
the other way round. Maybe if others are interested in working on this we
could turn it into a general use module.

whole thread is about OOo integration

I didn't see that one,

Posted by bradfordcp@drup... on May 27, 2008 at 12:03pm

I didn't see that one, thanks for the link!

~Chris

Worth looking at this module

Posted by catch on June 6, 2008 at 9:39am

Worth looking at this module as well: http://drupal.org/project/fileframework

So, anything new to report

Posted by SamRose on September 26, 2008 at 5:01pm

So, anything new to report here? Plugins working yet? Need help with them?

Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation

Sam Rose
Hollymead Capital Partners
P2P Foundation
Social Media Classroom

I was rushing towards the

Posted by bradfordcp@drup... on September 26, 2008 at 6:57pm

I was rushing towards the end of the project and made a few critical mistakes while generating the nodes. The actual API bit where the information is passed to and from the parsers is solid, the mapping is where the work really needs to be done. I will have a lot more time to contribute to my modules starting next week. Most notably making the mapping system usable and generating usable nodes. Now if anyone would like to work on parsers (I know I want to, but the framework needs to be done first). I would be happy to answer any questions / put together basic documentation.

~Chris

Parser?

Posted by beckyjohnson on October 6, 2008 at 9:29pm

Hi there,
I just installed this module and don't know what to do because on the upload page, I don't have anything to select to parse a document or image that I upload. I just get the option, Automatically Detect. I uploaded the Simplecore and Zend because it looks like they parse but they don't work for this. So, what can I do to fix this ?
I read the documentation but I can't find anything useful that explains what's going on.

Thanks,
Becky

Project information

Description

Project schedule

Status updates

2008-04-21

2008-05-14

2008-06-03

2008-06-11

2008-06-25

2008-07-01

2008-07-16

2008-08-08

Discussion Link

Other Information

Comments

Thoughts...

Post from maling list

I didn't see that one,

Worth looking at this module

So, anything new to report

I was rushing towards the

Parser?

SoC 2008

Group organizers

Group categories

General

New groups

Group notifications