CiviCRM and Voter Files

Events happening in the community are now at Drupal community events on www.drupal.org.
Anonymous's picture

Recently went through an import of 2.4 million voter file records into CiviCRM. Went through a lot of testing, sharing some preliminary information here on things we discovered. The 'we' used in this document refers to me and other resources within my company, Trellon. Working with recent releases of CiviCRM 1.4.

First, default CiviCRM distros are not suitable for working with massive numbers of records. After importing the 2.4 million records, and before adding any custom fields, the main and advanced search interfaces became unusable. Trying to save information into a contact record became nearly impossible. The reasons for this were myriad and included database design, data access queries, as well as the hardware we were working with initially.

So, addressing these salient needs, we took the following steps -

1) Optimize the data model - We removed all foreign key restraints and changed all unique indexes to regular indexes to begin with. This lead to a speed increase using the default InnoDB tables and cleared the way to change the table types to memory resident or ndb clustered. More on the eventual deployment follows...

We also added numerous indexes throughout the database. For individuals, we indexed first name, last name, address information, and a variety of other fields specific to data access objects within the system.

2) Customize the data access interface - The default queries for accessing individual contact records in CiviCRM work well enough, but the default search interfaces blow up under scale. In some cases, we ended up with queries that would kill the server without load. To address this, we wrote custom search interfaces specific to the needs of the client, using name, address, action fields, contributions, and other fields. We tied these search interfaces into specific indexes we created in the database.

Writing custom search interfaces required us to learn smarty, which has advantages and disadvantages like all other development frameworks. DLobo was a huge help in getting us up to speed.

3) Throw Hardware at it - We put the database itself on a dual xeon 3.0 ghz machine with 4 GB of RAM. This lead to measurable performance increases over a single processor box with 512 MB of RAM. In terms of inserting records, we used a custom script hitting the CiviCRM API. On the smaller box, it took about 30 seconds per record to import new contact records after about 50k records were imported. On the bigger box, that number went down to about 3 seconds. For the majority of the custom searches we had written, there was about an 80% performance increase. This was with minimal changes to my.cnf.

One thing the hardware upgrades made possble was the consideration of moving the CiviCRM installation into memory resident tables by changing the table types. We did not take this step for the production deployment of the database, given the fact we had a known stable installation. I did put several tables into memory on a development server, however, and found it was not as productive as I would have hoped. While there were some performance gains, temp tables were still being written to disk on the larger queries. Perhaps there is a way to force mysql to do all temp tables in memory, but I did not find it. Another drawback was simply the size of the database when it is in memory - it took about 2.8 GB on the server to run all CiviCRM database tables in memory. Under load (like with 1000s of people hitting it) performance would degrade as mysql shifts to using disk and swap space when available memory runs out.

More specific details to follow as I process all the upsizing insanity. This model, operating on a large cluster, could result in a highly scalable system.

M

Comments

Large db's and CiviCRM ...

lobo's picture

First off all a huge huge thanx to michael, morbus and the folks at trellon for taking the lead and using CiviCRM :)

The CiviCRM team learnt quite a few lessons from this project and we have incorporated most of them with our upcoming 1.5 release. Here are a few thoughts and comments

  • using like with wildcards is awful and does not use indexes (i..e LIKE '%lobo%'). The new search interface will probably transition all this power to the user for most fields
  • mysql does not use multiple indexes when optimizing a query. So we needed to index sort_name and domain_id together in civicrm_contact to be effective
  • sometime specialization is a good thing :). The current CiviCRM 'name' search actually does a 'sort_name' and 'email' search simultaneously under the covers. With a large data set, the left join needed to do the above is way too compute intensive to be useful. Better to seperate the sort_name search from the email search then combine them
  • ORDER BY / DISTINCT are operators that will slow things down a lot especially on complex queries. if you need to use them simplify the query as much as u can
  • sometime multiple queries are better than 1 query. The general civicrm query model is:

    a. get the count of the records matching the parameters
    b. get the limit/offset
    c. get the data using a complex sql query using the LIMIT/ORDER operators

    for large db's the following is a better option (i think, the question is whats the threshold)

    a. get the count of the records matching the parameters
    b. get the id's of the records matching the parameters using the ORDER / LIMIT clause (note that this query is far simpler than query c in general)
    c. foreach of the id's in b, get the data needed (u could do this in 1 query or multiple queries)

  • This is the First

    techsoldaten's picture

    This is the first of a bunch of suggested improvements on the CiviCRM data model we are going to be putting out in the next few weeks. Basically, we have some users on the system already, but we are going to scale the system to address the needs of a statewide campaign in Oregon shortly.

    In terms of access, query optimization and data model engineering are going to do huge things for CiviCRM in terms of scalability. Also, addressing issues in terms of database optimization can go a huge way towards making the system ready for prime time with 1000s of users.

    We are about to perform some performance testing on CiviCRM with all the tables stored in pure memory. Well, maybe not all the tables, but the contact and address record tables certainly. While I have my doubts about how effective this will be, I will be posting the results of my testing along with a mysql upsizing tool for CiviCRM in this panel. It will include a basic script that changes the index and foreign key relationships to make them accessible to memory resident and NDB table types. There will be an advanced tool that also performs some of the database optimizations we found most useful.

    M

    forking?

    joe.murray's picture

    I have some political party clients that I want to convince to move to CiviCRM for voter tracking at some point from an old client server app. I notice Lobo talks of incorporating a number of performance improvements into 1.5, but techsoldaten is talking of making a script available for db optimization for large scale deployments. I'm hoping it will be possible to try to keep from forking CiviCRM. I imagine, for example, that what Trellon is doing in one state will be highly applicable in other jurisdictions for things like searches, etc. Can the high volume, lower functionality interface be configurable on install, or through the admin panel??? or something?

    Joe Murray

    Joe Murray

    Database Upgrades are not Forks

    techsoldaten's picture

    Database upgrades to CiviCRM, imho, are not forks, they are optimizations. Fundamentally, they affect how MySQL operates as they are changes to database table definitions and primary key structures (which, for those of you who are following, only serve to maintain database integrity that is already enforced at the application level). Thus, I would really only be worried about this from an administrative standpoint instead of a code maintainence standpoint.

    M

    CiviCRM

    Group organizers

    Group notifications

    This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: