RFC: Best practices for managing indexes (particularly on cck tables)

Posted by greggles on December 10, 2007 at 9:21pm

As a site goes live and we do performance tweaking and benchmarking one of the common results is adding database indexes. I post this into the CCK group since one of the common sources of slow queries is for queries against the CCK tables. For example, a site that has a block that queries a field table to look for information about that field for every page load can quickly make its way into the slow query log. Adding an index to the field is likely to be the best solution.

I'm not criticizing the cck table structure, which I think is quite good, but instead asking for advice on how to manage indexes on database tables in general and on cck tables specifically. Assuming that the query is in custom code:

Add an index via a hook_update_n in the .install file for that module with a comment explaining why we added it
Add a comment near the query that says "If you change this query, check the index from hook_update_X to see if it still makes sense"

If it's not in your code but in another module (e.g. views) then step 1 probably stays the same, but I'm not sure where to document item 2.

I want the index to be easy to migrate from the development environment to test/production and also to be documented in a way that future admins can easily know where to look to see if it is still necessary.

Any feedback on this idea? What do other people do to manage indexes - aside from "just install them and forget them" ;)

Thanks!

PS yes, I know block cache is a decent solution to my original problem, but sometimes indexes are good too.

Comments

Needs a UI

Posted by moshe weitzman on December 11, 2007 at 1:08am

Nice topic. A couple of thoughts.

Would be nice if CCK did all the thinking for us. That might be too much to ask here though. I wonder though if CCK could create indices based in inclusion of fields in various Views.
Next best, would be a checkbox or other form element on the field definition page which states - create index. The module would be in charge after that, since some fields get complex. In addition to the checkbox, this module would present an overview of your index status for all CCK fields.

create index checkbox

Posted by matt_paz on December 11, 2007 at 12:50pm

I love that idea. Devel has been invaluable for me, but I sometimes forget to create indexes ahead of time that I wouldn't have forgotten if this option had been available. Might save a little time.

Agreed that putting this in the UI

Posted by mlncn on December 11, 2007 at 3:06pm

And as much as possible in the code would be fantastic.

On putting it in the UI could it integrate with devel's query logger to show that information (if devel is enabled), do you think?

~ ben melançon

Agaric Design Collective
http://AgaricDesign.com/

^{benjamin, agaric}

devel

Posted by moshe weitzman on December 11, 2007 at 4:59pm

wqell, devel would show any schema queries which create indexes. but on an ongoing basis, devel is not a schema browser. you will want to use dba module for that.

I meant

Posted by mlncn on December 11, 2007 at 10:29pm

for showing slow query or general query time information for queries involving CCK tables and fields.

^{benjamin, agaric}

As moshe pointed, the

Posted by yched on December 11, 2007 at 5:12pm

As moshe pointed, the difficult point (and probably the main reason why CCK doesn't already have this) is about fields that are stored in multiple columns.
Sometimes, some of those columns are not relevant at all for indexing (e.g format for text fields).
Sometimes, you (content.module or the field module author) cannot really tell in advance, this depends on the use site admins make of the data - mainly, what filters / sorts they use for their views. Didn't really check, but I guess address fields or fullname fields fall into this. Plus : what kind of indexes are to be created ? one multicolumn index ? several single columns indexes ?

So it's not a matter of a simple checkbox on the field's settings page :-) I agree that 'intelligent index creation by Views analysis' would be cool, but er, just thinking about the code makes nervous :-)

I'm not sure Views currently gives us the tools / API to inspect what are the filters / fields / sorts / etc used in what views (apart from parsing all views, I mean). I think we had / have the same kind of issue when views data need to be refreshed because of a field change / deletion...

core fields

Posted by moshe weitzman on December 11, 2007 at 7:25pm

we could probably make very accurate guesses about indexes that are needed for core field types like text, number, user/node ref.

several of us have been emailing about defining and updating fields in code versus the GUI. perhaps we should expand the field definition to include indexes. the use case i think of is that someone could package up a 'classfiied ads' app that had all the right indexes for this application. the fields and content types for this app would be bundled in an install profile or module or some other "package." The point is that we need a way for fields to declare indices, based upon "expected" usage.

maybe this "intelligent index creation by Views analysis'" should be a GHOP task :).

i'm emailing eaton/bjaspan a link to this thread since they have good ideas.

Indexes in code now, magic for later

Posted by wim leers on December 11, 2007 at 8:22pm

As moshe says, as the defining and updating of CCK fields in code rather than through UI improves, we could as well add the ability to define indexes. It's better to have too many indexes than too few, right?

While this isn't the most efficient method, this would allow us to have indexes for CCK fields sooner rather than later. As Views is still evolving, and its core will probably be included in Drupal 7 core, it might be better to wait until then to make the automagical indexes reality.

too many too few

Posted by greggles on December 11, 2007 at 9:35pm

It's better to have too many indexes than too few, right?

In moderation, yes. Indexes in database are just like indexes in books. Imagine if for every chapter you have to add words not just to one index, but to 20 of them. The more indexes you add, the slower it gets for the database to add content.

knaddison blog | Morris Animal Foundation

Thanks for the invite,

Posted by bjaspan on December 11, 2007 at 10:14pm

Thanks for the invite, Moshe.

Giggles's (sorry, Greg, that's what I think when I read your username :-) question was about creating indexes in a staged way from dev to product and keeping track of which queries the indexes are for (why they were created) in case the queries change. Right now I use hook_update as you suggest.

In D6, Schema API has 'description' fields for tables and fields and it could for indexes as well. You could have code that uses hook_schema_alter to add in whatever extra indexes you want along with a description field saying what it is for. We could also create a module that lets admins create/drop indexes from the UI and store the the description/reason why. Of course, this is non-stageable data in the database, not code.

On the topic of CCK creating indexes, first of all I definitely agree that fields ought to be able to specify indexes for themselves. For example, a nodereference field should definitely have an index on the field_name_nid column because everyone uses custom backreference queries.

However, as others have said, just creating indexes on individual columns is often not the right approach. An index often needs to have the correct fields in the correct order in order to be useful for a query. I bet that a lot of work exists on how to determine exactly what index is best for a given query but building that logic into Drupal would certainly be a not-short-term project.

Creating extra indexes willy-nilly on all CCK fields is not a good idea. Indexes take CPU and disk IO to keep up to date on all INSERTs and UPDATEs, so they are not free. If they are not being used, they are wasteful.

Now that I have blathered on uselessly for a bit, I'll make a useful suggestion: Moshe should invite David Strauss to this conversion. I suspect he knows more about this topic than the rest of us.

well put

Posted by greggles on December 12, 2007 at 3:11pm

correct fields in the correct order in order to be useful for a query

Yes. I'm not sure about the idea of index checkboxes. Making these so easy to create will mean that people create too many of them. There are things that should be hard to create to prevent them from being used too much and I feel indexes fall into that group.

We should probably have some general analysis of the most common cck fields to see if your point about nodereference field really holds up in 100% of the cases or if there are certain situations where it doesn't make sense.

Also, I mailed David.

--
Knaddisons Denver Life | mmm Chipotle Log | The Big Spanish Tour

knaddison blog | Morris Animal Foundation

66% suffices

Posted by moshe weitzman on December 12, 2007 at 3:50pm

You don't have to be useful 100% of the time to be added IMO. I think 66.6% of the time is sufficient justification for auto creating an index. I agree with Barry that reference fields meet this criteria.

I guess the checkbox and index creation could be done by a Contrib module for now, until we get a sense for how useful it is. If that module had copious text to help admins decide when to use an index, that would be extremely valuable.

Good indexes

Posted by david strauss on December 12, 2007 at 5:15pm

Indexes serve two purposes:
(1) Data integrity (reference and uniqueness)
(2) Performance

It is simply a matter of proper modeling data to achieve (1). Achieving (2) generally requires knowledge of how the system uses the data. Plus, CCK has a very poor data model for maximizing (2) because the columns are split into many tables. A database cannot create a cross-table index, so any query where the WHERE and ORDER BY clauses, combined, span tables may have performance problems that are impossible to solve without changing the data model.

dont sounds so optimistic

Posted by moshe weitzman on December 12, 2007 at 8:55pm

CCK only uses so many tables if you employ fields that span multiple content types, or fields that use the 'multiple' feature. but vanilla fields are all grouped together in the same table and are joined to node. Perhaps we should focus on those fields in order to simplify the problem.

I've not been that dirty with CCK

Posted by earnie@drupal.org (not verified) on December 12, 2007 at 10:07pm

I have barely unwrapped CCK. But would it not be faster to have the node_revisions table altered by CCK rather than putting the data in another table that needs joined?

Which leads one to think

Posted by earnie@drupal.org (not verified) on December 12, 2007 at 10:00pm

Which leads one to think that we might be better served by CCK if CCK altered the node_revisions table to add the data column. This way we could add the indexes to increase performance.

Adding a method for fields

Posted by karens on December 12, 2007 at 12:58am

Adding a method for fields to specify indexes would not be that hard -- some additional information provided in hook_field_info or hook_field_settings would work. We could do this to indicate any columns that should always be indexed (like the nodereference nid and userreference uid), but figuring out ahead of time what other indexes might be needed or useful no matter how or where the field is used starts to get pretty complicated.

There is an outstanding issue right now that addresses the fact that the content module only includes the first column of multiple-column fields in its Views tables, and it should provide a way to expose more than that for fields that have multiple columns. The proposal I like best is to create a way for field modules to identify exactly which other columns should also be exposed as Views fields (since not all columns are that significant). If we did that, a logical extension would be to add an index to the schema for any column added to the Views table. For most field modules, that would be their 'primary' column only.

Pointing out a long-standing CCK-Views performance issue

Posted by mlncn on December 13, 2007 at 12:26pm

Many CCK fields and using views causing max_allowed_packet errors with MySQL query "UPDATE cache"
http://drupal.org/node/121390

Is there a possible relation? Should we be telling views to use indexes, don't cache, in certain situations?

^{benjamin, Agaric Design Collective}

^{benjamin, agaric}

I have added a patch for D5

Posted by owen barton on April 12, 2008 at 1:15am

I have added a patch for D5 CCK that does a (simple) approach like this in http://drupal.org/node/241078, with the aim that we could get a more fine grained index control for fields in D6 CCK.

Thats a nice patch. Anyone

Posted by moshe weitzman on September 10, 2008 at 1:52am

Thats a nice patch. Anyone available to review it?

Indexing Approach

Posted by drupdrips on August 23, 2008 at 11:47am

As already noted above, for a normal user, it is unlikely they will be able to tell the difference as they may just select every field in a blind way going by a general thinking that all indexing is good.

Two things that I have seen about CCK tables from my limited use of Drupal so far .. I have mainly used it to build forms using CCK Taxonomy fields or Content Taxonomy fields --> so mainly derived from taxonomy terms and USUALLY these taxonomy fields have poor cadinality (because of 2 -4 options values etc.) and are not good index candidates anyway. Also for these taxonomy fields keep in mind the CCK table is only storing the term id # and not the actual term value so when the query hits the back end sql is always doing additional joins with term_data/vocabulary anyway .. so its not a straight forward lookup to begin with. In any case, I would think if at all you would give the provision of a checkbox to add indexes for a given CCK field that could be considered as a separate performance enhancement type module rather than something to check off during CCK form creation. Here are my reasons :

It should require collecting stats / analyze whatever (not familiar with all DB types here) at the DB level as a pre-requisite activity to derive the cardinality factor for the CCK table. (abstraction layer or some other way to interpret the results based on the type of DB will make this complex I guess). In other words you have to have sufficient data volume to begin with to even see what makes sense and may need to re-evaluate later on unless the user can predict ahead of time what columns to index on based on the physics of the data. For the normal user there is a much higher risk of single column index proliferation when they are trying to creating indexes blindly in the beginning.

Theoretically speaking there could be a module that factors in the index usage based on best practices to come up with an algorithm which then makes some suggestions. However, what one perceives as a good index candidate maybe easy to determine in the simplest of queries but in complex queries with joins etc, the way the execution path turns out may be quite different where the index even if it is present wont be used. Having said that for a CCK table which already has a view defined with filters on fields (other than taxonomy types maybe).. here are my thoughts :

The module can show to the user ALL views defined that uses CCK field type Filters (only in filters) NOT in views fields selection .. we don't care if fields selection is there in views, if there is not a filter for it it does not get factored in the where clause and so no indexing benefit (It will be a full table scan anyway whether you have index or not). It shows these views (which qualify as having the CCK field filter type) on the form and asks the User to check which view they may be interested in optimizing. Once the user selects the views, and hits analysis button maybe, the following can take place.

(a) Ask user to collect stats/analyze the CCK table that is part of the view (for simplicity of the example let's leave out views fusion).

(b) check the size of the table or rows num of the table and if its below a certain threshold suggest no index needed, the table is too small and exits, otherwise it continues

(c) Build an array of fields (from the views filter list) in the order of cardinality stopping short of a minimum cardinality percentage threshold (below which it does not make sense anymore ~ like 30% mark maybe).

(d) Revise the cardinality array to remove fields whose operator is LIKE or CONTAINS etc. ; Put greater than/less than operator fileds as the last one .. if more than one "greaterthan/less than" field keep the one only with the highest cardinality factor as the last key of the array and discard the rest. Also discard any other fields whose data type may not be desirable.

(e) check if there are any existing indexes on the CCK table and if so verify which of these indexes have their first field/column within the array built in (c). If none, then no existing indexes will benefit the current view in consideration and it breaks out of the loop to suggest a composite index maybe with the fields in the order of array in (c).

(f) if there are any existing indexes selected from step (e), evaluate the first field of these indexes to check where it stands on the ranking that was ordered in step (c). If the ranking matches the top position of array in step (c) that index is considered priority 1, if it matches third position of cadinality it gets priority 3 and so on. It then repeats this loop for the next field of the existing index in order of the composite key of the index assigning a similar priority number to the overall index. At the end it concatenates these numbers to arrive at something like say 132 for index A , or 213 for index B or 13 for index C (where A,B,C are currently existing indexes). From these numbers choose the group that have the highest number of digits and among those choose the lowest number value. Its basically trying to say given a cardnality factor that meets an indexing standard, I prefer to have a composite index (including more columns) to really pinpoint my search, and within that category I prefer the one in the order that represents highest selectivity of data as it traverses the b-tree. So index A would become the choice.

(g) Once out of that loop it basically either recommends an index with a statement to run (requiring the user the option to choose to run the create index statement - as opposed to automatic) OR advises that the view in consideration is already benefitting from an index currently in place (which was arrived at from step(f) .

There is one more thing that would really help but I am tired and quite frankly not even sure if this post will be accepted as I think I may have crossed the word limit ...

If there is a way to generate an explain plan by running the view and see if the index is really being used ..that would be even better ..

Ping

Posted by mikeytown2 on March 27, 2010 at 9:46pm

I created some code that creates an index for every relationship & filter of all views
http://groups.drupal.org/node/57213#comment-167908
Feedback would be nice.

RFC: Best practices for managing indexes (particularly on cck tables)

Comments

Needs a UI

create index checkbox

Agreed that putting this in the UI

devel

I meant

As moshe pointed, the

core fields

Indexes in code now, magic for later

too many too few

Thanks for the invite,

well put

66% suffices

Good indexes

dont sounds so optimistic

I've not been that dirty with CCK

Which leads one to think

Adding a method for fields

Pointing out a long-standing CCK-Views performance issue

I have added a patch for D5

Thats a nice patch. Anyone

Indexing Approach

Ping

High performance

Group organizers

New groups

Group notifications