While I was trying to beat one the oldest opened issue on xapian module IIRC: Index uploaded files, I end up figuring out that it could be useful to have a way generate a plain text representation of any field.
So, I have started a sandbox that have enough code to get a plain text version of file fields: Plain.
I have seen other modules that do that kind of conversion, but they are not in D7(PDF To Text, HTML2Text) or have a completely different approach(File Framework), so I started one from scratch.
It would be great of someone can point me to any other efforts I maybe have not seen or comment about this.
This could help to search engines to index better content, for example on xapian I would only need to implement hook_node_update_index()
and use just one line there using this module:
<?php
function xapian_files_node_update_index($node) {
return plain_entity_fields_plain_contents($node, 'node', $node->type);
}
?>
Comments
marvil07, Cool initiative! I
marvil07,
Cool initiative! I am actually working on something similar called the Converter module, which has a side effect of being able to convert various documents to plain text for search indexing. My overall goal is to provide a solution that is backend agnostic so that core Search, Apache Solr, Search API, etc can use it. In addition, it can convert documents to various formats in an effort to compete with platforms such as Sharepoint. Would love to collaborate if interested.
Thanks,
Chris
definitely
It would be great to collaborate.
That was exactly my motivation to start plain project instead of embed that code on xapian module.
I would like to know your opinion on this:
Converter module description states:
That means it is only about files.
Plain module description states:
Plain module is about converting any field(including file fields) into a plain text version.
So, we can solve this for example in one of these ways:
Do you have in mind other alternatives for this?
marvil07, Awesome! Look
marvil07,
Awesome! Look forward to speaking more about this. Although Converter is geared towards files, it's underlying parsers do accept raw input so you can even convert HTML to plain text, an MS word document, etc. I don't think I have coded anything to allow that, though.
I will make sure I download the plain module and try it out, look under the hood, etc.
Great initiative!
Chris