How can I get a better Free Tagging Performance?

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
bennos's picture

Hi

I am developing a bigger site (D5) with a free tagging vocabulary. Currently the vocabulary has 16.000 entrys. i have searched with google and on drupal.org, but I have not found some performance improvement like cache or pacthes.

What are you doing?

I think I am not the first one with bigger vocabulary. Some links or code snippets would be great.

thx
bennos

Comments

core issue

catch's picture

If you're using mysql, you can patch to remove LOWER() completely with no side effects, then it'll use indexes and cache the query. There's a core issue to remove it in D7 here: http://drupal.org/node/277209 which needs revising now the new database layer is in.

WOW! 16k vocabulary?!

PlayfulWolf's picture

WOW! 16k vocabulary?! definitelly watching this thread. Currently developing site with dictionaries up to 200-500 words, and thinkin that it is too much but 16k...

drupal+me: jeweler portfolio

250k

catch's picture

At the EOL taxonomy sprint we were working with scientists who have sites with over 250,000 terms in structured vocabularies - although these need autocomplete for administration so have the same issues as described here (and many more).

Advcache

cirotix's picture

Have look at the advcache module which provides several performance patches. Among them there is a taxonomy patch that cache a few part in the taxonomy module. I don't know if it will solve or not the issues you have, as you haven't tell us what they are. The biggest taxonomy that I have ever used is ~4000 terms with no real performance issue.


Damien Cirotteau
http://www.rue89.com

Sphinxsearch for D5 :)

markus_petrux's picture

Hi,

You may also consider using http://drupal.org/project/sphinxsearch

It helped us resolve quickly complex freetagging queries. Example for tags música y cine (music and cinema in spanish).

Cheers

Why do you have 16k terms ?

jcisio's picture

Why do you have 16k terms ? It may be a better idea to use short terms to avoid cross, duplication and superposition. Taxonomy module allows combination of different terms. Use only two terms "cinema" and "music" instead of a dozen: cinema, music, cinema and music, music and cinema, cinemas and music...

My current site will have about 8k nodes and 10k terms after the conversion from Joomla! and I don't see any problem. But, sure, I'll have to minify and purify them! With about 10k terms of more, a full text search could be better.

--
[vi] www.thongtincongnghe.com
Trang tin điện tử về CNTT, Viễn thông, Điện tử...

@jcisio

bennos's picture

@jcisio
the vocabulary is one of a new job exchange, all entries are the typical job titel.
No term is duplicate. I manage the vocabulary in the read only mode. New terms must be approved.

@cirotix
advcache I have found, but my last info was that the taxonomy cache patch was broken. I will take a look on the new version(1.9).

@playfulwolf@drupal.org
Taxonomies with 500 entries are no problem. I have several sites developed with vocabulary in this dimension.

Something over my goal:
I want to develop a job exchange with a guided search, but not search in a normal way.
Typically you search over fulltext, but often this is not good. A lot of data is structured and can also be searched in structured way.
When you search with a search engine the quality of the SERP's lives and dies with the word you have typed in the search box.
For a job exchange, it is not good to loose useres, because the have not found a job 5 times with his choosen keywords oder the founded jobs does not match really.

With 3 words you can find everything in the WWW, but you must choose the right words. In my case the 3 types are
1. job type
2. location (HS vocabulary)
3. job titel (Free tagging, read only)

I use views for all.

Currently I have 15.000 PI's a day. Server is a Medium Server with Athlon, 8GB Ram, Lighttpd, Mysql5, Memcache (2GB), PHP 5.26 with APC. My load is very small. I make backups every hour and only on the backups my load gets a little bit higher then 1.5.

bennos, I'm not sure if I

jcisio's picture

bennos,

I'm not sure if I get what you mean. An example, perhaps?

15k pageviews/day is nothing in comparision with your server. It can serve at least 10x more without any problem.

--
[vi] www.thongtincongnghe.com
Trang tin điện tử về CNTT, Viễn thông, Điện tử...

The performance of the

bennos's picture

The performance of the website is ok and the setup is for a higher load and additional websites.

My Problem is only the performance free tagging searches. The Usability of the free tagging search formular is not good, if the search of related term is not fast enough. the first results should come faster. On Facebook, free tagging over all schools is really fast.

I will try the taxomy cache patch in advcache. then the tags should come faster from memcache. I hope so.

Does anybody know, how I can benchmark this?

rskhanna's picture

I am not a developer but a very hands-on subject matter expert.

Having been badly burnt by poor performance in the past, I am trying to pre-empt performance issues. Please suggest what is the ideal way to build the site: through taxonomy, content types (CCK) or what?

Expected load:
50-500 users logged in. Approximately 10 times as many anonymous. This may increase further.

Taxonomy:
1500 terms with ten or so cross references:

For example:
Vocabulary 1: all animals
Vocabulary 2: all countries
Vocabulary 3: mammals (so when we query mammals and all animals in the taxonomy, we get a list of mammals)
Vocabulary 4: Profession of user
Vocabulary 5: Type of User

when we query vocabulary 1, 2 and 3, we get all mammals in a particular country, etc.

We will have almost no graphics. We are a text-based information site.

How would we like to present the information?
1. Based upon:
1. the level of the users knowledge - beginner, intermediate, advanced
2. The profession of the user - Physician, College Professor, Scientist, Student, hobbyist
3. The type of audience - member of Press, Employer, HR Professional
2. Reverse pyramid format - start with the least information needed and then expand based upon the users need. Do not overwhelm the user with info. Look at the principle of google.com. The centerpiece is a big search box. Info then expand from there. We should have something like that - a simple interface that expands and gets more complicated as per the needs of the users.
3. Topical - like "All about Lions"

What would be best for performance: creating vocabularies, building content types through CCK, what do you all recommend?