Possible improvement with caches

Posted by markus_petrux on July 7, 2008 at 1:17am

Hi,

I would like to share some thoughts about caching in Drupal, and then see what you people think about it. Not sure if this is new though.

Sometimes, cached objects are created on demand (while a page request occurs) and use expiration times. For exemple, cache_filter table usage in check_markup. In these cases, the logic is more or less, like this:

if ($cached = cache_get($cache_id, 'cache_filter')) {
  return $cached->data;
}
// object is not cached, so here we do a lot of stuff to
// build the thing, then cache it with an expiration time
cache_set($cache_id, $text, 'cache_filter', time() + (60 * 60 * 24));

Well, the problem here is that if several page requests come at the same time, there will be several processes doing the same job and caching the same object, concurrently. For hi traffic sites this might be a problem.

There's a little change that can be done to minimize this effect. Using a variation of the example above, code would look like this:

// We here make sure cache object is NOT expired.
if ($cached = cache_get($cache_id, 'cache_filter') && $cached->expire > time()) {
  return $cached->data;
}
// If we got an expired object, make sure a minimal set of concurrent requests
// do the same job that we're about to do the process and build data.
if (!empty($cached->data)) {
  cache_set($cache_id, $cached->data, 'cache_filter', time() + 30);
}

// Ok, so here we do a lot of stuff to
// build the thing, then cache it with an expiration time
cache_set($cache_id, $text, 'cache_filter', time() + (60 * 60 * 24));

I believe comments show what I'm trying to mean. If we had something cached that is expired, we store the object again giving it a few more seconds, so a minimal set of possible concurrent requests do the same job at the same time.

If we're using InnoDB, then the cached record is in the buffer pool for sure, so we the overhead in updating the expiration time is minimal versus the cost of doing the same job X times concurrently. Here "the job" is output filtering, where the node could be complex or long. The same login could be applied for cached pages.

I'm using this approach on a site that currently has around 20000 page views a day. Not much, but enough no have notined the benefits of this method. I'm using a particular way to cache pages. It works for any combination of user roles. Cached pages are cleared when related content is updated, which is something particular of the site implementation. Also, to save space for unused cached objects, an expiration time of 5 minutes is used, so cron can keep the cache page table with reasonable size. Since this uses expiration times, here's when the method explained above is doing a great job. I'm using the same method to cache blocks and certain queries. The site can be found here if you want to take a look.

Well, not sure if this can be of any use, but I thought it would be interesting to share. Maybe the method outlined above could be applied in some places od Drupal core, that use cache with expiration times. I haven't found any report in the Drupal issues queue.

Second thing I would like to mention is that maybe cache_set could be improved by using REPLACE INTO rather that the UPDATE/no-affected-rows/INSERT approach. Well, only for MySQL enabled sites, since this is a MySQL extension, but when performance is a point, I think something like this worths.

This might be a noticable benefit specially when caching pages or big chuncks of data. With REPLACE, there's just one statement transmitted over the network, to where the MySQL lives, so the MySQL server can deal with this kind of statements faster.

Here's a post in the MySQL Performance Blog about REPLACE INTO:

http://www.mysqlperformanceblog.com/2007/01/18/insert-on-duplicate-key-u...

Comments

Another cache improvement, pages for registered users

Posted by markus_petrux on July 8, 2008 at 3:01pm

Just wanted to mention another technique I'm using to improve the performance of the site. :-)

Ok, so I'm caching pages for non-anonymous users using a particular version of page_set_cache/page_get_cache that is invoked from hook_init/exit in one module which weigths in a way that it loads last, so any other modules can do what they need for non-anonymous visits.

Keys for cached pages are derived from $_GET, node id, and user role. When a node is changed in any way, related cached pages are cleared. When a cached page is retrieved, a couple of regular expressions are processed to change information that is particular for each user in the page, like logged in nick, etc.

I analized performance improvements by comparing time to generate a page versus time to serve a cached page, also comparing number or queries and memory used.

For instance, for this page, http://blogs.gamefilia.com/blogs , generating the whole page for a registered user takes:

Total execution time: 0.674910 seconds
Total SQL time: 0.395989 seconds
Total SQL queries: 312
Total memory used: 1,616.53 KB

Serving the same page once it was cached:

Total execution time: 0.045080 seconds
Total SQL time: 0.011370 seconds
Total SQL queries: 20
Total memory used: 236.76 KB

This is a lot times faster!

A note on the queries executed for cached pages: these include deny control access, session handling, visit counters and cache access. 11 millisecons were enough in this example to execute them all.

All methods I've seen to cache whole pages only deal with anonymous users, so here's something that could be of interest to others.

As a side affect there is also less network traffic on the MySQL side. Here's an example:

Only local images are allowed.

To sum up the method I'm using:

cache page processing takes place within hook_init, hook_exit for a module that loads/executes last.
cache keys are built so differences in content between user roles are minimum.
some kind of preprocess is needed before sending cached pages to deal with user diferences in content.
a plus if you can clean up cached pages when content changes.
pages are cached with an expiration time (5, 10 or 15 minutes is good, depends on number of visits, etc.), so cache table doesn't grow forever.

And that's it.

Cheers

Any chance of getting this

Posted by catch on July 8, 2008 at 3:50pm

Any chance of getting this into CVS or the issue queue? Seems like enough of an improvement that it'd be worth trying to work up into a contrib module or core patch.

Not sure how

Posted by markus_petrux on July 9, 2008 at 8:50am

I'll show here code snippets so you can get an idea. Note that this is based on D5, which is the version I'm using for the above mentioned site.

You need a module with a weight higher enough so it loads last. In this module we use the following:

function mymodule_init() {
  /
   * Here we'll do something similar to
   * drupal page_get_cache() +
   * drupal_page_cache_header()
   */
  _mymodule_page_cache_init();
}
function mymodule_exit() {
  /
   * Here we'll do something similar to
   * page_set_cache()
   * This module is executed last, so we're about
   * to finish normal Drupal page processing, which
   * ends at drupal_page_footer().
   */
  _mymodule_page_cache_exit();
}

The functions used in the above hooks use a helper function that looks like this:

/
 * This helper function decides if the
 * current page can be cached or not.
 */
function _mymodule_get_cache_options() {
  global $user;

  // Default cache options.
  $cache_options = array(
    'key' => FALSE,
    'lifetime' => 300,
  );

  // We only deal with GET requests that have no particular
  // message that only belongs to current page flow.
  if ($_SERVER['REQUEST_METHOD'] != 'GET' || count(drupal_set_message()) > 0) {
    return $cache_options;
  }

  // Decide if the page can be cached, depending on
  // $_GET['q'] or whatever else. Example:

  if (preg_match('#^node/([0-9]+)$#', $GET['q'], $matches)) {
    if ((int)$matches[1] > 0) {
      /
       * Note: when a node is changed, cache_clear_all needs to be invoked
       * with cache key prefix specified here.
       */
      $cache_options['key'] = 'my_cache_key_prefix_node:'. (int)$matches[1];
    }
  }
  else if (preg_match('#^(tagadelic|taxonomy)/.*$#', $GET['q'])) {
    /**
     * For taxonomy related pages we'll expire pages based on lifetime
     * specified here. In this case we don't care when content changes.
     */
    $cache_options['key'] = 'my_cache_key_prefix_taxonomy';
    $cache_options['lifetime'] = 900;
  }

  // If the page is to be cached, then complete uniqueness
  // of the cache key.
  if ($cache_options['key']) {
    // Take user roles into account.
    $cache_options['key'] .= ':'. implode('.', array_keys($user->roles));
    // Use GET to make sure we have unique key for this page.
    $cache_options['key'] .= ':'. md5(serialize($GET));

    // Increase cache lifetime for anonymous users.
    if ($user->uid == 0) {
      $cache_options['lifetime'] *= 3;
    }
  }
  return $cache_options;
}

Here's an example of what we can do when page processing exits.

function _mymodule_page_cache_exit() {
  global $user;

  $cache_options = _mymodule_get_cache_options();
  if (!$cache_options['key']) {
    return;
  }
  if (!($data = ob_get_contents())) {
    return;
  }
  ob_end_flush();

  if ($data) {
    // Filter page for current user dependencies.
    if ($user->uid != 0) {
      // perform regular expressions to deal with parts of the page
      // that depend on the current user, such as uid, name, etc.
    }
    // Compress page for DB storage if GZIP is available.
    if (function_exists('gzencode')) {
      $data = gzencode($data, 9, FORCE_GZIP);
    }
    cache_set($cache_options['key'], 'cache_page', $data, time() + $cache_options['lifetime'], drupal_get_headers());
  }
}

Here's what we can do in our hook_init implementation:

/**
 * Again, my implementation is a bit more complex, but this
 * snippet helps to get an idea, hopefully.
 */
function _mymodule_page_cache_init() {
  global $user;

  // See if page cacheable.
  $cache_options = _mymodule_get_cache_options();
  if (!$cache_options['key']) {
    // Page is not cacheable.
    return;
  }

  // Get cached page if available.
  $cache = cache_get($cache_options['key'], 'cache_page');
  if (!$cache || empty($cache->data)) {
    // Page is cacheable, but not cached yet.
    ob_start();
    return;
  }

  // Page is cacheable, and already cached.
  $current_time = time();
  if ($cache->expire <= $current_time) {
    // However, cached object is stale (already expired) so we give it a stale time
    // for concurrent requests to send what's already cached while current request
    // will rebuild the page and cache it at exit.
    cache_set($cache_options['key'], 'cache_page', $cache->data, $current_time + 30);
    ob_start();
    return;
  }

  // Ok, here we deal with IF_MODIFIED_SINCE, Etags, etc.
  // That we use for anonymous users.

  // For registered users we send headers so page is not cached
  // on proxies, user navigator.

  // Here we ungzip cached page.

  // Alter page for user dependencies.
  if ($user->uid != 0) {
    // Regular expressions or whatever against $cache->data.
  }

  // Send page, close session and exit.
  print $cache->data;
  session_write_close();
  exit();
}

I believe implementation of this method is highly dependent on how the site is to be used, content, user roles, etc.

Maybe this documentation is enough to help other implement this idea.

If I had to patch core, I would remove the need to do this job from hook_init/exit, and put some additional hooks here and there so an external module can take advantage of this particular method. However, as you can see it can be done without patching core.

Another possible approach to "support" this cache method would be to open some kind of "hook" in core, similar to what's done for fastpath related stuff, maybe... any other idea? Maybe I could adapt my code in a different way so it's easier to port the method.

I think this is an

Posted by Owen Barton on October 2, 2008 at 6:38am

I think this is an interesting technique that should work really well for sites with somewhat limited and well defined (i.e. preg_replaceable...) changes for auth users - and should be ideal for some very high traffic sites (e.g. a popular news site with lots of users and user profiles, blogging and commenting interactions only for auth users) - a contrib module would be a great contribution if you can figure out a way to generalize it!

For sites with more complex interaction with authenticated users (for example flagging new posts, organic groups or anything AJAX) I think this would get very hard and (if attempted) the rules would rapidly start approaching the same kind of work as Drupal does to build the pages anyway. For these sites I think a more fine grained approach (ideally one that makes smart enough decisions by default) to caching page elements themselves is likely to be more appropriate. I wrote up a basic proposal and patch at http://drupal.org/node/152901 - reviews are welcome :)

My caching improvements

Posted by eli on July 11, 2008 at 3:59pm

markus-

I've spent a fair bit of time playing with caching in D5 myself.

Here are two of my patches:

HEAD requests don't return proper cache results, which screws up browser-level caching. http://drupal.org/node/231190
If you're using Views to create RSS feeds, they aren't cached at all in the Page Cache. http://drupal.org/node/231424

Also, you might want to check show global status like '%tmp%'; in mysql and make sure that Created_tmp_disk_tables variable is in check. I found problems with Taxonomy module generating temp tables on disk (http://drupal.org/node/171685)

New db layer. Take a look

Posted by moshe weitzman on August 28, 2008 at 10:48pm

New db layer. Take a look atthe new DB layer in HEAD. The cache_set() has been converted into a "merge query" which, in mysql, does an INSERT ON DUPLICATE KEY UPDATE which is more robust than our previous pattern

INSERT ODKU -vs- REPLACE INTO?

Posted by markus_petrux on August 29, 2008 at 1:32am

db_merge is a great thing, so this pattern can be used in many other places. The new DB layer is impressive job. :)

Though, I opted for REPLACE INTO because the statement itself is shorter than INSERT ODKU, and the 'data' field of cache tables may be big. Note that field values need to be specified in the INSERT itself, but also duplicated for the ODKU clausule.

I assumed it would be cheaper in terms of PHP memory usage, network traffic between Apache/PHP and MySQL (think about max_allowed_packet), and probably less memory/CPU usage on the MySQL server.

Regarding cache management, do you know if they did any performance test comparing REPLACE INTO -vs- INSERT ODKU? I read the issue pointed by the CVS commit where these changes where applied to HEAD, and if there was something about this, I missed it. :-/

Oops! I should have RTFM best before

Posted by markus_petrux on August 29, 2008 at 2:28am

Now that I have read the MySQL again, it looks like it is possible to use values specified in INSERT for the UPDATE part. ie. something like this is possible:

INSERT INTO table (pkey,a,b) VALUES (1,2,3) ON DUPLICATE KEY UPDATE a=VALUES(a), b=VALUES(b);

So, I could very well have opted for INSERT ODKU. :-o

Our ODKU statements are

Posted by moshe weitzman on August 29, 2008 at 5:18am

Our ODKU statements are currently very verbose - perhaps you could submit a patch to make them briefer.

I just had the time to look at CVS HEAD

Posted by markus_petrux on August 29, 2008 at 12:01pm

I just had the time to look at CVS HEAD, and it looks to me that field = VALUES(field) is not used for the ODKU clausule.

Filed an issue here about it:
http://drupal.org/node/301501