Last updated by bradfordcp@drup... on Fri, 2008-08-08 14:21
Project information
Project page on drupal.org: http://drupal.org/project/docapi
Current status: Coding Workflow
Description
The goal of this project is to create a plugin based import module for Drupal that allows the upload of office suite file formats which would be parsed into Drupal nodes. Allowing even novice CMS users to generate content using a familiar office productivity suite. Another perspective is that of an organization using Drupal for the first time and trying to migrate existing data into it from a library of office files.
Project schedule
April 14 - May 26
-Create the specifications for the router.
-Generate the table / registry structure as well as determine which hooks will needed for parsers.
-Determine which triggers and actions to provide for automatic creation
-Get more familiar with document types to be implemented over the course of the summer.
-Get community feedback as to what file formats to support along with which styles as well.
May 26 - June 3
-Code the router
-Test the router, verify the appropriate hooks and parsers are being called for any given file
X-Stub out parsers being created over the course of the summer
June 3 - July 7
-Code Word Processor parsers
-These should focus on Office Word XML 2003 & 2007 as well as ODT
-All styles listed above should be implemented at the very least
July 7 - August 18
-Code Spreadsheet parsers
-These should focus on Excel formats as well as ODS
-At the very least tables should be returned
-If a cell contains a function, the appropriate value should replace it in the HTML table.
Status updates
First Report
2008-04-21
Task summary:
- Began work on SoC prerequisites
To the drawing board
2008-05-14
Task summary:
- Framework design has begun, look for a more in depth update in the next week.
Humble Beginings
2008-06-03
Task summary:
- Setup new dev environment (I switched to my new dev machine and had to get used to a new OS (all is good now))
- Began coding router from my designs (I will try and upload some of these ideas soon)
- Almost finished an ERD for the modules table structure. It seems pretty solid now, but I am still working out the last few kinks, before uploading it.
- Catching up on D6 changes / researching mimedetect and the upload system
Todo:
I am a bit off schedule right now, but I should get everything squared away this week.
Clear guide rails make for a smooth ride
2008-06-11
Task summary:
-
Stubbed out plugin module
- Created data structures for various sections of the parser (these will be universal across parsers).
- Fixed namespace errors in parser
- Fixed data structure errors (using objects instead of arrays)
- Began work on the module's settings page (main) as well as the plugin settings page.
- Began coding the workflow of the module.
Todo:
Code workflow, this along with the hooks is key to making this module usable. A LOT of time will be spent on UI and interaction with Drupal.
2008-06-25
Task summary:
- Started work on workflow, realized that I need to have plugin detection in place before this is possible
- Started work on plugin detection, added new field to the db corresponding to the module which has the plugin implementation/li>
- Added update logic (I have not used this before) to the module so it will not break across versions.
Todo:
Finish plugin detection and workflow
2008-07-01
Task summary:
- Working on workflow with JPEG skeleton
- Finished plugin detection software, Simply enable the plugins module and on the next cron (or when manually called)
- Added update logic (I have not used this before) to the module so it will not break across versions.
Todo:
Finish workflow
2008-07-16
Task summary:
- Rewrote file handling to use file.inc methods
- Retooled a couple of DB tables as the infromation stored is a mirror of the Drupal's file, which led to reworking a lot of the validate function
- Began work on mapping node fields to document fields (metadata)
Todo:
Finish field mapping
2008-08-08
Task summary:
- Successfully mapping document fields to their associated metadata
- Fixed DB tables, I was missing a couple of fields
- Working on creating nodes from documents in the library
Todo:
Finish generating nodes from library
Discussion Link
http://groups.drupal.org/node/9929
This page is where the initial idea was proposed along with initial thoughts and discussion.
Other Information
For now this is a place to pull together some information from various sources. I will organize this a little later, for now I just want to have a central location for information related to this module.
From my app:
PROJECT DETAILS:
Features to Implement:
-The Router module
-Create a set of hooks other modules implement to parse the various data types
-Allow the parsers to specify mime types that they are interested in and keep an internal registry of those types
-Handle files put to the server routing the data appropriately.
-Create the appropriate node(s) for the uploaded data
-Map any metadata to the correct node fields
-The Parser Modules
-Microsoft Office
-Parse the .doc file type provided by various versions of Word.
-Parse the .xls file type provided by various versions of Excel.
-Open Document Support
-Support the various file types under Open Document
-Implement parsers for ODT, ODS, and possibly ODP
Styles to Support:
These should be the minimum styles supported by any parser
-Lists with the appropriate formatting / styles, (ul, ol, dl)
-Bold text (strong, b)
-Italicized text (em, i)
-Underlined text (u)
-Paragraphs and breaks
-Basic symbols, certain symbols should be supported (and encoded) by default (&, copyright, tm, etc)
DELIVERABLES:
1. The router module with appropriate hooks and documentation to aid parser developers
2. Core parser modules, these should handle .doc, .xls, .odt, & .ods. I believe this will provide core functionality as well as a point of reference for any further parser development.** EXTRAS **
Should there be extra time I would like to implement the following:
-Code Presentation parsers
-Explore more formats other than Microsoft Office and Open Office (PDF?)
-Create a batch processor allowing the upload of many documents (possibly an archive) to be parsed all at once.
From dldege:
I'm just suggesting the document import and node creation can be two different steps.
Here is a possible workflow for manual mode.
1. User uploads a new word.doc file.
2. Document is parsed and data is stored into the document repository.
3. In the admin settings the user has specified that the node type 'page' is allowed to map fields from imported documents.
4. User goes to node/add/page
5. This node add form is altered by your module to present the user with document selection and field mapping form, ui, whatever.
6. User creates the node using a mixture of newly typed in information and/or information brought in from one or more documents
7. Submit - a new page node exists.Automatic mode
1. User uploads a new word.doc file.
2. Document is parsed and data is stored into the document repository.
3. New document upload trigger is fired
4. The user has configured that on this trigger a new page node should be created with the title set to the document title and the body should contain the imported document content.
From jpetso:
Don't duplicate file type parsing, but rather rely on standard mimetypes that can even be retrieved by dopry's MimeDetect module already. Current mimetype databases are pretty complete, I'm pretty sure they already include recognition patterns for all of the binary MS Office formats (.doc, .xls, .ppt, ...), the XML office formats (.docx, .xlsx, .pptx, ...) and the OpenDocument formats (.odt, .ods, .odp, ...). Recognizing mimetypes is a solved problem, you shouldn't come up with a new solution from scratch here.
Also, I think joining forces with existing parsers instead of writing new parsers or XSL stylesheets might get us better HTML output and less maintenance issues. Remember, other people have similar issues and are working on them as well! For example, a quick Google search throws up an ODT to XHTML stylesheet/converter (and other ODT stuff), as well as the OOXML/ODT converter that also works with stylesheets and might make it possible not to touch the awkward XML of the new Office formats at all, but rather transforming it into ODT and from there into XHTML.
Lastly, I think that the original .doc and .xls files will probably be the most requested format to import, as OOXML is still far from being as widely used and deployed as the old binary formats. Especially for those, you need to rely on external converters, parsing those manually is suicide. I remember Abiword to have a flexible command line interface, that could be a feasible transformation method for Unix servers. No idea about how to transform the .xls and .ppt formats, but I guess there's at least some simple converter lying around somewhere. I don't think doing .doc and .xls will be feasible for shared hosting where you can't install non-PHP external converters, but I might be wrong.
From alex_b:
The router is what FeedAPI does. Ken mentioned this module further down the thread. Consider using FeedAPI as basis for this. I can help you with that, I am really interested in exploring FeedAPI's boundaries. FeedAPI has a parser/processors architecture that could be very suitable for what you will be building. One problem that I can think of is that FeedAPI creates a node to represent a data feed in the system. We could find out away around this though. The exciting part here is that ideally, Aron will be working on Aggregator for D7 as SoC project. So insights that we win here could go directly into the next generation aggregator.
From CrashTest:
http://drupal.org/project/search_attachments there may be some functionality here
| Attachment | Size |
|---|---|
| docapi_erd.pdf | 27.6 KB |
| DocImport_Process2.png | 151.82 KB |
Comments
Thoughts...
Hmmm...I'd rather pitch in and help create a way to directly parse MS Office XML than to try and convert to ODT, convert to XHTML.
Main reason being that MS Office to ODT conversion has a spotty record for producing desired results. The people this is targeted towards won't really understand why sometimes it "doesn't work" (because of Office>ODT>XHTML transformations). (that is, if I am understanding this correctly)
I am all for re-using and saving time and energy. The point here is there is nothing really good out there that does direct MS Office to valid XHTML conversion. I think the XML route proposed above has promise. I don't think he should throw a round-about, half-working solution into the mix if a better one can be had by building it.
This makes sense. It would be wortwhile to at least investigate what exists for parsing OOXML
I think you are mostly right about this. It's generally not very easy, and many cases not possible, to install many of them without root access. Perl and Python modules are an exception, as there usually is a way to install them.
I think there have been at least a few people who've taken a crack at php Word to XHMTL converters. We should probably look around and see what is out there...
I hope that this does work.
Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation
Sam Rose
Hollymead Capital Partners
P2P Foundation
Social Media Classroom
Post from maling list
Maybe you saw this already: http://lists.drupal.org/pipermail/development/2008-May/029856.html
whole thread is about OOo integration
I didn't see that one,
I didn't see that one, thanks for the link!
~Chris
~Chris
Worth looking at this module
Worth looking at this module as well: http://drupal.org/project/fileframework
So, anything new to report
So, anything new to report here? Plugins working yet? Need help with them?
Sam Rose
Social Synergy
Open Source Ecology
P2P Foundation
Sam Rose
Hollymead Capital Partners
P2P Foundation
Social Media Classroom
I was rushing towards the
I was rushing towards the end of the project and made a few critical mistakes while generating the nodes. The actual API bit where the information is passed to and from the parsers is solid, the mapping is where the work really needs to be done. I will have a lot more time to contribute to my modules starting next week. Most notably making the mapping system usable and generating usable nodes. Now if anyone would like to work on parsers (I know I want to, but the framework needs to be done first). I would be happy to answer any questions / put together basic documentation.
~Chris
~Chris
Parser?
Hi there,
I just installed this module and don't know what to do because on the upload page, I don't have anything to select to parse a document or image that I upload. I just get the option, Automatically Detect. I uploaded the Simplecore and Zend because it looks like they parse but they don't work for this. So, what can I do to fix this ?
I read the documentation but I can't find anything useful that explains what's going on.
Thanks,
Becky