Sphinxsearch integration

Events happening in the community are now at Drupal community events on www.drupal.org.
johsw@drupal.org's picture

Following Yelvington I'm crossposting this to the following groups: SOC2008, Knight Foundation, Newspapers on Drupal and Search

One thing that would be really cool, is a module integrating http://www.sphinxsearch.com/ and Drupal. Our experience is that core search doesn't play nice when you have alot of nodes (we have 150.000+). Indexing simply kills the server.

So instead we use Sphinx. It's REALLY fast, both when searching and when indexing. BUT everytime we alter our content-types we have to manually reconfigure the sphinx configuration. This is why I propose this as a module for the SOC08 - a module that integrates Sphinxsearch and Drupal

The module could adjust the Sphinx configuration according to changes made to content types. And it could reindex on cron etc...

With more and more projects being dependant on enterprise scalability, this could be a project that would ensure Drupal a well-functioning search even with huge amounts of data.

I not sure I have the knowledge to be a mentor for the project, but I would love to help out in any way that I can.

More info about Drupal and Sphinx integration:
http://drupal4hu.com/node/129
http://www.sphinxsearch.com/forum/view.html?id=100

Comments

Seems like a good idea to

moshe weitzman's picture

Seems like a good idea to me.

I have already begun working

Rob Knight's picture

I have already begun working on a module that does this for a project that I am currently working on. The module should be in a usable state within the next few weeks.

Cool...

johsw@drupal.org's picture

...I'm really looking forward to seeing it. Perhaps you can show what you've made so far?

I heard at DrupalCon that

karens's picture

I heard at DrupalCon that there are plans for a code sprint to rework Drupal's search to allow for different pluggable back ends (like the solar module). I don't know anything else about it, but it sounds like there should be some coordination with that effort. Not sure who is handling this, but amazon is setting up several code sprints, so maybe he knows more??

The search system is pretty

moshe weitzman's picture

The search system is pretty pluggable already. Namely, you can disable search module, take over search/node menu handler, add your own blocks, etc. there might be more to it, but not that much more.

Are you there is enough meat

chx's picture

Are you there is enough meat on this?? We are using sphinx and I have posted http://drupal4hu.com/node/129 -- what would such a module do? Figuring out the sphinx query automatically is not trivial to say the least and it really depends on what you want to search... the search itself is just a few lines of code... so what exactly would such a module do?

There are two important

Rob Knight's picture

There are two important things that a module could do:

1) Creating the sphinx config file (or, more specifically, the index definitions)
2) Integration with existing Drupal systems, such as replacing the existing Drupal search, and Views integration (similar to Views Fast Search)

Those features would both make it easier to deploy Sphinx and would extend the usefulness of the service.

views integration

moshe weitzman's picture

views integration like views fast search would be really swell.

johsw@drupal.org's picture

Well I was thinking something close to views. But I havent got a precise spec in my head, so I'm hoping for some support. I'll try to think out loud...

With views you can print any field from any content type. A sphinx integration module could be similar, but instead of chosing what fields to print and how, you could select which fields to index and how.

OR

We could make it fully automated - I know you say it's non trivial, but it must be possible. It would be really cool, if you could just check or uncheck which content types you wanted to index. A contenttype must know how and where it's data is stored somehow, otherwise it wouldn't be able to print the contents. This is almost the same query we want for the indexing, or?

OR

We could make the really complex and combined one - fully automated but customizable.

Anyways my indexing sql looks like this:

SELECT content_type_avisartikel.nid, content_field_brdtekst.field_brdtekst_value, \
GROUP_CONCAT(DISTINCT term_data.name SEPARATOR ' '),  content_type_avisartikel.field_underrubrik_value,\
content_field_ekstern_skribent.field_ekstern_skribent_value, node.title, \
GROUP_CONCAT(DISTINCT profile_values.value SEPARATOR ' ' ), \
UNIX_TIMESTAMP( content_field_avisdato.field_avisdato_value ) AS avisdato \
FROM content_type_avisartikel  \
JOIN node FORCE INDEX(nid) ON content_type_avisartikel.nid = node.nid \
JOIN term_node ON node.nid = term_node.nid \
JOIN term_data ON term_data.tid = term_node.tid \
JOIN content_field_skribent ON content_field_skribent.nid = content_type_avisartikel.nid \
JOIN content_field_ekstern_skribent ON content_field_ekstern_skribent.nid = content_type_avisartikel.nid \
JOIN content_field_avisdato ON content_field_avisdato.nid = content_type_avisartikel.nid \
JOIN content_field_brdtekst ON content_field_brdtekst.nid = content_type_avisartikel.nid \
LEFT JOIN profile_values ON content_field_skribent.field_skribent_uid = profile_values.uid \
AND profile_values.fid =1 \
WHERE node.status = 1 \
GROUP BY content_type_avisartikel.nid \

It's a killer to maintain. There's gotta be some way we can automate and interface this. Help me out :-)

What about using xmlpipe2?

markus_petrux's picture

...Figuring out the sphinx query automatically is not trivial...

Have you thought about using the Sphinx xmlpipe2 source type? Using this method is possible to write a function to output what needs to be indexed in simple XML form, so that function can build node content based on current Drupal method, just like Drupal search, Apachesolr module, etc. do.

Seems like we could add a

moshe weitzman's picture

Seems like we could add a new tab to content types page that looks like the Fields tab from CCK. But it has checkboxes where you indicate which fields should be in the index. Given that info, we can select the right fields and perform the right Joins.

It is true that this is a lot live Views. One could potentially ask the admin to make a View that contains all the fields he cares about and then the Sphinx module would use the Views query builder to figure out the fields and joins.

I just posted up thoughts

BlakeLucchesi's picture

I just posted up thoughts from the DrupalCon BoF on my blog (http://boldsource.com/articles/advancing-drupal-search). Robert Douglass also discussed it recently here (http://robshouse.net/2008/03/05/event/drupalcon-boston-solr-bof). I think the thing that most people agreed with there was that Drupal search doesn't scale, doesn't perform as well as many others, and isn't as feature rich. This leads us to conclude that we should use other search implementations when available to do so (such as solr/sphinx).

The question then, is how do we make the system flexible enough for a site admin to implement any engine he/she has access to, and how can Drupal provide them with an easy way to configure and integrate it with their site so that Drupal doesn't limit the features that the search engine boasts?

Sounds like a couple different projects...

webchick's picture

Project #1 is writing a module to handle integration of Sphinxsearch and Drupal. @Rob, can you confirm/deny that this will already be done by the time SoC starts (May 26)? If so, this is not a good candidate as a SoC project.

Project #2 is abstracting search.module so that it can take pluggable back-ends. This is a completely different problem set. @Blake, are you going to be a SoC student again this year? If so, is this something you're interested in tackling? If not, would you want to mentor a student tackling this? :)

Yes, I can confirm that the

Rob Knight's picture

Yes, I can confirm that the Sphinx search module will be done before SoC starts.

I agree, it makes sense to treat the pluggable search engine system as a separate project, and I would update the Sphinx module to work with that if/when it is completed. If someone wishes to begin planning an architecture for that now, I can incorporate some of those ideas into my implementation of the Sphinx module.

As I noted above, there are

karens's picture

As I noted above, there are already plans for a code sprint to work on creating a pluggable search engine for Drupal, so you should find out more about that (or maybe participate in it) so you aren't duplicating effort. It is going to take place at the University of Minnesota (where the usability study was done), but I don't have any more information than that.

Another approach

yelvington's picture

This is a bit of a tangent but perhaps worth mentioning.

We're using FAST to index all of our newspaper websites, a territory that covers a wide range of data types implmented on a wide range of technologies. We simply disabled Drupal's search.module entirely and implemented forms that point to a small custom PHP layer on top of FAST, which handles all rendering and output for search queries, whether they involve Drupal, mdAutos, Spotted (photo galleries), or Yahoo Web data.

The real Drupal integration was in the area of indexing to ensure that the FAST idea of what we had in Drupal remains in sync with that's really in Drupal. We implemented an extensible module that tells FAST what's new and what needs to be deleted. Custom extensions for each Drupal content type (including users, not just nodes) give FAST a clean XML representation of the object.

Integrated search examples:
http://search.onlineathens.com/
http://search.blufftontoday.com/
http://search.jacksonville.com/

Good idea to create a small

chx's picture

Good idea to create a small helper which pulls the views query to dosplay as a sphinx query. This is... five lines of code? I think. I still do not need see what someone will spend three months with sphinx and Drupal integration.

Agree with chx. As I said,

Rob Knight's picture

Agree with chx. As I said, I'm working on a Sphinx module already anyway, and I don't anticipate it taking anywhere near 3 months. If it takes another 3 weeks I'll be annoyed!

There might be a good case for a general search infrastructure overhaul, but that is being addressed already by the code sprint mentioned by KarenS.

Progress?

David Lesieur's picture

I think people will also want to discuss engines like Sphinx at the search sprint, so if there is "prior art" by then, it could provide an interesting basis for discussions. ;)

Any Progress?

freynder@drupal.be-gdo's picture

I would be interested in this module as well. Do you still plan on releasing something?

Thank you

Sphinx sample

markus_petrux's picture

You may want to take a look at this. Very helpfull to get into sphinx (talk slides and sample drupal module).

http://mikecantelon.com/story/sphinx-talk-slides

I found it today googling :-)

I'm also insterested in Sphinx. First for a Drupal based site, then to index a message board with around 14 million posts. I just started to play with sphinx configuration, creating sources, indexes and so on. For the moment, I'm using the sample PHP code to play with searches from client box, where server is in separate box.

I still have some doubts on best method to setup things, For instance:

  • Not sure yet how to index comments. Maybe using xmlpipe2 data source + a little program that generates the XML stuff in a similar way drupal search works, rather than simple query with mysql data source.

  • Not sure how to setup all the UTF-8 stuff.

  • Too soon, but I think it's possible to implement facets or something...

Here's one sample sphinx.conf I'm using to play with Drupal, if that helps:

http://www.gamefilia.com/files/samples/sphinx.conf.txt

Thank you for those

freynder@drupal.be-gdo's picture

Thank you for those details.

I also agree that the indexer should not fetch data from the database directly. It would be good if the module could generate this xmlpipe2 data so that other modules can add keywords and additional attributes can be generated. A cron could be set up that would call the indexer with this data. The number of items could then also be limited to a maximum number of nodes at a time like it is in the current search module, in order not to impact the system.

I'm thinking about implementing this if nobody is working on it.

I'm playing with the XMLPipe approach right now

markus_petrux's picture

The "problem" here is that a single pipe needs to be associated with a single index, so it has to be generated everything in one single step, or think about several sources/pipes/indexes only when limit is reached, otherwise you endup with lots of indexes, which may impact sphinx performance, probably. Ideally, only main+delta would be enogh, though I'm not sure if the Drupal site can generate so many content that it reaches some kind of limit...

I'll post here the module I'm working on as soon as it reaches something useful.

For the moment, the xmlpipe generator performs database connections in chunks, processing 1000 (maybe 5000, have to play with that) nodes at a time, closing the mysql connection, connecting again, and processing the next 1000 nodes while sending the XML with no output buffering, so the other side, the sphinx indexer process can do its job as soon as output is being sent.

The content for each node is processed like search module does, allowing other modules include their stuff via 'update index' hook, then stripping all HTML tags (while keeping word boundaries) and removing redundant whitespaces to reduce data transmited.

The xmlpipe generator is almost done, I need to work a little bit more on it, and then provide a simple search form or something, so it all does something...

My environment is, more or less, several Apache boxes, separate server for sphinx, and separate server for MySQL.

Edited: Well, I'm having problems because of some kind of memory leak in Drupal, or maybe my environment. I'm talking about it in sphinx forums, here:

http://www.sphinxsearch.com/forum/view.html?id=1215#9699

The problem I'm having is there's a loop where each node in the system is processed to generate the HTML, so it can be sent to the XMLPipe, but memory_usage in PHP is getting higher for each loop, until PHP memory limit is reached.

I have no idea where this memory consumption comes. I'm running latest stable versions of everything (Apache, PHP, Drupal is 5, etc.). I guess it is something implicit in how Drupal works. Hence I'm stuck at this point.

Here's the code I'm using to generate the XMLPipe. There are some snippets commented that I'm using to make tests, but I hope it can get an idea of what I'm doing.

function sphinx_db_reconnect() {
  global $db_url, $db_type, $active_db;
  static $connect_url;
  if (!isset($connect_url)) {
    if ($db_type != 'mysql') {
      return;
    }
    if (is_array($db_url)) {
      $connect_url = $db_url['default'];
    }
    else {
      $connect_url = $db_url;
    }
  }
  mysql_close($active_db);
  $active_db = db_connect($connect_url);
}
function sphinx_xmlpipe() {
  $nodes_total = (int)db_result(db_query('SELECT COUNT() FROM {node} WHERE status = 1'));
//  $nodes_total = 2;
  $nodes_per_chunk = 500;
  // Make sure no output buffering is being used.
  if (ob_get_level()) {
    ob_end_clean();
  }

  print '<?xml version="1.0" encoding="utf-8"?>' ."\n";
  print '<sphinx:docset>' ."\n";

  for ($start = 0; $start < $nodes_total; $start += $nodes_per_chunk) {
    // nid IN (9601, 9602, 10452) AND
    $result = db_query_range('SELECT nid FROM {node} WHERE status = 1 ORDER BY nid DESC', $start, $nodes_per_chunk);
    $i = 0;
    while ($node = db_fetch_object($result)) {
      $i++;
      $node = node_load($node->nid);
/

$node->title = 'title-'. ($start + $i);
$node->body = $text = 'body-'. ($start + $i);
$node->uid = 0;
$node->changed = time();
///

      $text = sphinx_get_node_text($node);

$globals_size = strlen(serialize($GLOBALS));
$text = "\n". 'node_count='. ($start + $i) .'/'. $nodes_total
  ."\n". 'memory_usage='. memory_get_usage()
  ."\n". 'globals_usage='. $globals_size;
//  ."\n". $text;

      // finally, output the xml for this document
      print '<sphinx:document id="'. ($node->nid + SPHINX_DOCUMENT_ID_OFFSET) .'">' ."\n";
      print '<nid>' . $node->nid . '</nid>' ."\n";
      print '<uid>' . $node->uid . '</uid>' ."\n";
      print '<timestamp>' . $node->changed . '</timestamp>' ."\n";
      if (!empty($node->taxonomy) && is_array($node->taxonomy)) {
        foreach ($node->taxonomy as $term) {
          print '<tid>' . $term->tid . '</tid>' ."\n";
        }
      }
      print '<title><![CDATA' . check_plain($node->title) . '></title>' ."\n";
      print '<content><![CDATA'. check_plain($text) .'></content>' ."\n";
      print '</sphinx:document>' ."\n";
      unset($text, $node);
    }
    sphinx_db_reconnect();
  }
  print '</sphinx:docset>';
  exit;
}

function sphinx_get_node_text(&$node) {
  // Build the node body.
  $node = node_build_content($node, FALSE, FALSE);
  $node->body = drupal_render($node->content);

  // Allow modules to modify the fully-built node.
  node_invoke_nodeapi($node, 'alter');

  $text = check_plain($node->title) ."\n". $node->body;

  // Fetch extra data normally not visible
  $extra = node_invoke_nodeapi($node, 'update index');
  foreach ($extra as $t) {
    $text .= $t;
  }
  unset($extra, $t);

  // Strip control characters that aren't valid in xml.
  // See http://www.w3.org/International/questions/qa-controls
  $text = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]#S', ' ', $text);

  // Strip off all tags, but insert space before/after them to keep word boundaries.
  $text = str_replace(array('<', '>'), array(' <', '> '), $text);
  // Strip off all tags.
  $text = strip_tags($text);
  // Remove redudant spaces and line breaks to reduce overall output size.
  $text = preg_replace("# +#", ' ', $text);
  $text = preg_replace("#(\s
)\n+#", "\n", $text);

  return $text;
}

The infrastructure team for

moshe weitzman's picture

The infrastructure team for drupal.org is very interested in alternate search solutions like this. for example, we just deployed xapian module for all drupal.org users. you might borrow code from that module's implementation. when this is in a good state, please notify the infra mailing list.

Glad to contribute, if possible, though

markus_petrux's picture

I think I reached a point where this approach doesn't work. I can't generate all content for XMLPipe because of some kind of memory leak, I'm unable to find the cause.

Any help would be really appreacited, otherwise I think I'll pregenerate content in a separate table, and then use sphinx with normal mysql source. That means having enough room for all pregenerated content. Maybe that's enough for me now, where the site I'm working on this has about 10000 nodes and 30000 comments. But drupal.org size is a lot more...

Maybe sphinx evolves in some way where it is possible to perform life document updates in index (not possible today), without the need to rely on main+delta indexes.

In summary: I'm not sure how to proceed, and maybe I choose a road where things can work for a limited number of nodes, and time will tell...

This is a memory leak in

moshe weitzman's picture

This is a memory leak in Drupal? Please elaborate.

markus_petrux's picture

I believe it should work in any Drupal 5.x installation. All the code above needs to run is a temp sphinx module, add hook_menu to create a MENU_CALLBACK for path 'xmlpipe' to invoke sphinx_xmlpipe() and then try something like this:

wget -O - -q -t 1 http://www.example.com/xmlpipe > xmlpipe.txt

In my system with 10000 nodes crahes because it reached PHP memory limit, which is 64M on my system.

Then, looking at the generated file, I'm printing out result of memory_get_usage() for every node. There I can see how it is growing and growing...

I'm also printing size of globals, but that doesn't grow. It must be something else. But I don't know how to see more than that. Something is using memory which is not released in the sphinx_xmlpipe() loop.

I don't know how to see if it is Drupal, PHP, or something else. :(

One thing I thought, if it caused by Drupal, is that Drupal doesn't use mysql_free_result(). Note that this code is processing all nodes in the system and so, it may need thousands of queries.

It would be nice if someone could grabs the code above into a temp module and test. Any advice on how to trace memory usage would also be appreciated, or maybe this needs a different approach...

My system is PHP 4.4.8 running as mod_php in Apache 2.2.9, eAccelerator 0.9.5.3, MySQL 5.0.51a, running Drupal 5.8 (with a lot of modules and custom code). Latest version of all components. Dev and production site run the same components. Life site in production is here.

node_load() keeps a static

moshe weitzman's picture

node_load() keeps a static variable that grows and grows. might be the culprit here. There is a param you can pass to avoid this.

Was able to reproduce in

freynder@drupal.be-gdo's picture

Was able to reproduce in Drupal 6.3. Resetting the node cache for node_load helps, but there is still additional memory consumed by calling node_load. Furthermore it seems node_build_content somehow also increases memory usage.

hmm... static vars is a problem for xmlpipe

markus_petrux's picture

I'll try with that parameter. Haven't thought about it.

Anyway, I'm rewriting the code so it checks amount of memory being used, and stops when 90% is reached. The next problem is max_execution_time, because all the process may take a while.

So... the xmlpipe will stop before anything like this breaks. That will create the main sphinx index, then it will be required to run more times for delta sphinx indexes, merging with main index until all nodes are indexed.

The process I'll try would be as follows:

1) Initialize the queue of nodes that need to be indexed.
wget http://www.example.com/sphinx-xmlpipe/init

This process will insert rows into a table, with just the nid. A nid in this table will mean that node needs to be indexed. Nodes are queued by this process, and also when node is created/updated, comments are posted, etc.

2) run sphinx indexer for main index with xmlpipe command like this:
wget http://www.example.com/sphinx-xmlpipe

This process will generate XMLPipe content for all possible nodes. Process will end when a) all nodes are processed, b) just before memory_limit is reached, c) just before max_execution_time is reached. The goal is to be able to send a complete well-formed XMLPipe stream.

3) run sphinx indexer for delta index with xmlpipe command like this:
wget http://www.example.com/sphinx-xmlpipe

4) run sphinx indexer to merge delta into main.

5) Repeat steps 3+4 untill all nodes are indexed.

6) Cron delta index in regular intervals to process new and updated nodes.

Once done and checked this method is able to process all my nodes I'll post the code, so anyone can test. It would be great to get feedback on this, as well as related to the approach this is taking. :-)

Yeah, reset in node_load() worked! :-)

markus_petrux's picture

It worked. Around 9 minutes to generate ~10000 nodes.

[user@hostname ~]# curl http://www.example.com/sphinx-xmlpipe > xmlpipe.tmp

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38.6M    0 38.6M    0     0  75233      0 --:--:--  0:08:59 --:--:-- 16107

Anyway, I have added controls in the loop so it stops execution before reaching memory_limit or max_execution_time. It also restarts DB server connection every configurable amount of nodes, so we do not get DB timeouts.

I have updated my previous comment to highlight how the process will work.

Now, it's time to complete these steps, clean up the code a bit, and work on the query form.

Ok, just sent a note to the infra mailing list

markus_petrux's picture

Let's see if someone there can find someone usable from what I've been trying here. If not, then I'll post the code somewhere so it can be used, at least, as an example or something, as I found little information on this.

Ok, marked this idea as rejected

webchick's picture

Ok, marked this idea as rejected. Sorry!

Good news though is that it sounds like it'll be implemented soon. :D

@webchick Yes I am looking

BlakeLucchesi's picture

@webchick Yes I am looking at applying to be an SoC student again this summer and I feel like doing something in regards to improving search would be a benefit to the community. My plan is to continue working with Robert Douglass and David Lesiur to see what is possible to extract from their work so far in creating the more extensible search API. Whether this would be a good SoC project or not, I'm not quite yet decided.

I'd like to see those 5 lines of code

webrat-gdo's picture

I'd love to see those 5 lines of code to implement Sphinx search into Drupal. I, and I'm sure others, are new to Drupal and have no clue about half of what you all are talking about. If Drupal is to become this user friendly-take, over the world app, things like implementing other search engines will have to be as simple as installing and configuring a module. Perhaps this is an over simplistic view of a complex matter.

Rob, rock on with your module for the better of us that are less capable programmers.

Any news on this module,

supered's picture

Any news on this module, Rob? A Drupal + Sphinxsearch integration is something that I badly need.

Btw, do you people know which tables should I use to build a Sphinx index? Or a reference? johsw@drupal.org has custom built tables, they don't exist in common Drupal installations (or at least it is a different version of Drupal. Mine is 5.7).

As far as I've seen, Drupal stores nodes in 'nodes' table, node_types in 'node_type' (is that right, the fields are indexed by varchars instead of integers?) and content in 'node_revisions'. How should I relate node_revisions to nodes? Any insight?

Thank you very much!

Concursos Públicos (portuguese)

nodes and node_revisions

Garrett Albright's picture

nodes and node_revisions relate in that node_revisions stores all the revisions of a given node under the same nid, but each revision will have a unique vid value. In other words, to find the most recent (current) version of a node with an nid of X, you query the node_revisions table with a query something like "SELECT * FROM {node_revisions} WHERE nid = X ORDER BY vid DESC LIMIT 1".

…Or, if you're not using Drupal's revisions feature and never plan to in the future, you could cheat and do a query like "SELECT * FROM {node_revisions} WHERE vid = X", because, in that case, the vid will always equal the nid, as there will only ever be one revision per node.

Thank you very much

supered's picture

Thank you very much Garrett.

By the way, by googling I also found this page, which shows some code. But where would that code go? Should it be in a module or should I change some PHP script?


Se você é brasileiro, não perca o Concurso da Petrobrás!

Sorry, that I can't answer.

Garrett Albright's picture

Sorry, that I can't answer. I haven't actually used Sphinx on a project yet, so I'm not really sure how it works.

two sphinx modules?

drupaloSa's picture

There are two sphinx integration modules on d.o but neither of them has actually made the module available for download. I'm curious about how they will progress.


drupaldersleri.org - Türkçe Drupal bilgi kaynağı

Yup!

johsw@drupal.org's picture

I'm the maintainer of the module Sphinx (http://drupal.org/project/sphinx). And I plan to release a 5.x beta version within a week. My focus will be the frontend and administration of available indexes, fields and attributes at first. Maybe later on (- depending on the work on the module on http://drupal.org/project/sphinxsearch) I'll work on the indexing part.

Best,
/Johs.

Sphinxsearch code in CVS head now.

markus_petrux's picture

I just need to figure out how to create a development release... I may even create a stable release since new code is running on production here with no aparent issues, but I may wait a little because of integration issues or other differences that someone else may find while testing.

The hardest part was document all relevant stuff in readme and other places. Please, forgive me if I made mistakes or something is difficult to understand. Help here is more than welcome. :-)

Then, future plans are described in the readme.

Cheers,
Markus

I got a development release ready

Another way of fighting Sphinx + Drupal problems

Taras_'s picture

Hmm...

We have two approaches of sphinx integration: indexing via xmlpipe (http://drupal.org/project/sphinxsearch) or direct indexing from database (http://drupal.org/project/sphinx). I'm not expert (even newbie or php programmer) of any kind in Drupal, so sorry for any mistakes. All thoughts here are from admin viewpoint :)

Short comparison:
1) indexing via xmlpipe via wget+apache+php
- It's completely disgusting way of integration. It's so bad not because of Sphinx, rather because of LONG RUNNING apache+php+drupal+many drupal modules stack. We have a fair chance to catch any possible mistake in drupal/php (or their modules) we didn't ever seen before (because of common short http queries).
- Project seems to be a little abandoned :(
- No Drupal 6 release.
+ Easy integration without need to write complex SQL queries, ability to index all content (not absolutely sure).
+ Documentation.

2) native Sphinx indexing via SQL
- No documentation, very short description.
- Not so easy integration: we need to write SQL queries and change Sphinx configuration to index new type of content (not absolutely sure);
+ We are able to build very accurate indexes. We can also build different indexes for different type of content. We are very flexible.
+ pspell
+ native SQL indexing are very-very well tested on different projects, not only Drupal. It's something we can rely on.
+ We have D6 release.

So, real man should rely on (2) - hard but stable way. But we live in real world. Sometimes we don't need small revenue for hard integration. So here my simple solution to different problems in (1) way (http://drupal.org/project/issues/sphinxsearch?categories=All)

1) DON'T DO "wget http://.../...xmlpipe.php..." from Sphinx indexer!
2) Create only two indexes in Sphinx: main + delta. No need to create dozens indexes in Sphinx to bypass Drupal/PHP problems!
3) Do "wget http://../ xmlpipe.php" from little shell/perl/ruby/php script in SMALL pieces (~10-50) and forget about memory leaks or another "always not so important bugs" in software;
4) Aggregate pieces into one big XML;
5) Feed this big XML to Sphinx.
6) PROFIT!!!

P.S. Sorry for bad English.

Newspapers on Drupal

Group organizers

Group categories

Topics - Newspaper on Drupal

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: