Plain text for fields

marvil07's picture

While I was trying to beat one the oldest opened issue on xapian module IIRC: Index uploaded files, I end up figuring out that it could be useful to have a way generate a plain text representation of any field.

So, I have started a sandbox that have enough code to get a plain text version of file fields: Plain.

I have seen other modules that do that kind of conversion, but they are not in D7(PDF To Text, HTML2Text) or have a completely different approach(File Framework), so I started one from scratch.

It would be great of someone can point me to any other efforts I maybe have not seen or comment about this.

This could help to search engines to index better content, for example on xapian I would only need to implement hook_node_update_index() and use just one line there using this module:

<?php
function xapian_files_node_update_index($node) {
  return
plain_entity_fields_plain_contents($node, 'node', $node->type);
}
?>

Comments

marvil07, Cool initiative! I

cpliakas's picture

marvil07,

Cool initiative! I am actually working on something similar called the Converter module, which has a side effect of being able to convert various documents to plain text for search indexing. My overall goal is to provide a solution that is backend agnostic so that core Search, Apache Solr, Search API, etc can use it. In addition, it can convert documents to various formats in an effort to compete with platforms such as Sharepoint. Would love to collaborate if interested.

Thanks,
Chris

definitely

marvil07's picture

It would be great to collaborate.

My overall goal is to provide a solution that is backend agnostic so that core Search, Apache Solr, Search API, etc can use it

That was exactly my motivation to start plain project instead of embed that code on xapian module.

I would like to know your opinion on this:

Converter module description states:

This module provides an API for converting files to and from various formats. The method of conversion is pluggable, so different backends such as unoconv, Apache Tika, and others can be used with a consistent interface.

That means it is only about files.

Plain module description states:

This module provide a way to represent drupal information(fields) as plain text.

Currently, the only generators(ctools plugins) available are:

  • file
  • : Convert files to plain text using external tools(pdftotext, ps2pdf, catdoc, html2text, unrtf, catppt, xls2csv).

Plain module is about converting any field(including file fields) into a plain text version.

So, we can solve this for example in one of these ways:

  • Generalizing a little more any of those:
    • Making plain module able to convert any fields into any output instead of convert any field into plain text(probably changing the name into field_converter).
    • Making converter module to be able to convert any fields into any output instead of convert files into any format)
  • Using Convert module into plain module to use it instead of embedding all conversion logic on the generator plugin.

Do you have in mind other alternatives for this?

marvil07, Awesome! Look

cpliakas's picture

marvil07,

Awesome! Look forward to speaking more about this. Although Converter is geared towards files, it's underlying parsers do accept raw input so you can even convert HTML to plain text, an MS word document, etc. I don't think I have coded anything to allow that, though.

I will make sure I download the plain module and try it out, look under the hood, etc.

Great initiative!
Chris