nginx caching with selective page purging

brianmercer's picture

Last week a new module appeared on Drupal.org named Purge.
Purge is designed to selectively purge affected pages from a caching reverse proxy after changes are made to the site. The Purge module can be used to add this functionality to the nginx native cache.

The Purge module relies upon the Expire module. The Expire module appeared some time ago when Boost dev mikeytown2 ripped his cache expiration code from the Boost module and created the Expire module as an api for other cache modules to use. The Varnish integration module now uses Expire to perform selective purges from the Varnish cache.

When content is changed on a site. like a node or a user page, the Expire module checks to see if that content is present in any forum lists, taxonomy lists, the internal alias list, views lists and cck references and then creates an array of URLs that are affected by the change. For example, a change to http://mysite.com/node/1231 might then cause changes in the front page, http://mysite.com/, the forum page if the node was a forum node, http://mysite.com/forum, the specific forum, http://mysite.com/forum/12, and any aliases such as http://mysite.com/title-of-node or http://mysite.com/forum/general-discussion.

The Expire module then passes that list of changed URLs to the Purge module which uses php-curl to send specially crafted http requests to your web server or proxy. The Squid proxy supports this functionality natively. You need only send an http request using the "PURGE" method instead of "GET" to Squid and Squid will purge from its cache whatever URL you send. Varnish has its own text based administration interface that the Varnish integration module uses for receiving purge requests, but the Varnish Control Language can also be used to handle Squid-like http requests as well.

Fortunately for us, nginx can also be configured to accept the http requests generated by the Purge module and use them to purge pages from the nginx cache.

First we need to compile nginx with a contrib module that adds the cache purge functionality. This feature does not exist in the stock nginx code. The module can be found at https://github.com/FRiCKLE/ngx_cache_purge/ I've added the module to the nginx on my Ubuntu ppa: https://launchpad.net/~brianmercer/+archive/nginx

Then make sure you have the php-curl module installed on your server. (php5-curl for Debian/Ubuntu)

nginx configuration begins with a directive at the http level which creates the cache. Since I'm using php5-fpm for my backend, I define my cache for use through the fastcgi interface:

  fastcgi_cache_path /var/cache/nginx/mycache levels=1:2 keys_zone=mycache:1m inactive=30d max_size=2g;

The specific options are not the subject of this post, but in short, this creates a cache which stores files at /var/cache/nginx/mycache, using a two level directory structure (for more nodes you want to create a deeper structure to prevent single directories from having thousands of files), using a cache named "mycache" that stores 1 megabyte of URLs in a lookup table in RAM (again more cached urls should be allocated more RAM but keep in mind these are just the hashed URLs and not the page data), keeping files on disk for a maximum time of 30 days regardless of page expiration, and with a maximum size on disk of 2GB.

Next we alter our server location that handles our Drupal requests. This will vary depending on your configuration, but here's mine:

  location = /index.php {
    include /etc/nginx/fastcgi_params;
    fastcgi_param SCRIPT_FILENAME /var/www/$host/drupal/index.php;
    fastcgi_hide_header X-Drupal-Cache; #optional
    fastcgi_hide_header Etag; #optional
    fastcgi_pass php;

    # Cache Settings
    set $nocache "";
    if ($http_cookie ~ SESS) { #logged in users should bypass the cache
      set $nocache "Y";
    }
    if ($request_uri ~ \? ) { # Purge doesn't handle query strings yet
      set $nocache "Y";
    }
    fastcgi_cache mycache;
    fastcgi_cache_key $host$request_uri;
    fastcgi_cache_valid 200 301 1d;
    fastcgi_ignore_headers Cache-Control Expires;
    fastcgi_cache_bypass $nocache;
    fastcgi_no_cache $nocache;
    add_header X-nginx-Cache $upstream_cache_status; #optional
    expires epoch;
  }

and then we create a new server listening on a random port on the localhost interface:

## Cache purging
server {
  listen 127.0.0.1:8888 default_server;

  access_log /var/log/nginx/caching.access.log;
  keepalive_timeout 0 0;
  error_page 405 $request_uri;

  location / {
    fastcgi_cache_purge mycache $host$request_uri;
    return 200;  #use until Purge module logs 404s better
  }
}

The "error_page" directive is required to keep nginx from rejecting http requests using the "PURGE" method with a 405 error "Method not allowed". The "return 200" is required because nginx will return a 404 "Not Found" if the page is not in the cache and the Purge module currently logs that as an error and we don't like those pink lines in our logs. Squid does the same, so this is just a matter of the Purge module logging a cache miss as a non-error response.

Then enable the Purge and Expire modules and set the proxy URL at admin/settings/purge to "http://127.0.0.1:8888".

This configuration can be used for multiple domains using the same cache.

If you use something like firebug you can check the headers and the x-nginx-cache header will show either HIT or MISS or BYPASS. The watchdog log will contain entries from the Expire module listing each URL purged.

Note that Boost does purge cached pages with query strings, but Expire does not. I see no solution at the moment other than to disable caching for any URL with a query string. Which brings us to the greatest limitation of nginx cache purging. Unlike Varnish, nginx does not perform wildcard purging.

Purge could handle query strings like http://test.brianmercer.com/forums/loquor-haero/camur-ibidem-quadrum?sor... if you could purge test.brianmercer.com/forums/loquor-haero/camur-ibidem-quadrum* using a wildcard at the end to purge any cached pages with query strings. But the nginx cache purge module doesn't support it and as far as I know has no plans to do so.

There are other limitations of Purge and Expire as well. As far as I know, Expire does not know how to purge Panels pages yet. It does have support for Views and cck.

Also, Purge and Expire are both at the dev stage of development and have no official releases. The Varnish module says that it uses Expire, however drupal.org stats show 1384 users for Varnish and only 71 users for Expire. I'm concerned that new development on expire code might be going on in Varnish without being ported to Expire.

Varnish can also purge the entire cache for a domain. nginx can only do this through something like "rm -rf /var/cache/nginx/mycache" and then if you want to limit it to a certain domain you'd have to create a different cache bucket for each domain.

If you're going to try this configuration you should also use my solution for caching aggregated js and css files or you're going to run into problems for the reasons explained there. http://groups.drupal.org/node/124709

While this configuration has some issues, for simpler sites it can provide fast and easy reverse proxy caching with very little RAM overhead.

This is very much a work in progress and I'm hoping you guys can provide some feedback and spot the problems. Thanks.

Comments

Excellent stuff. Thanks for

omega8cc's picture

Excellent stuff.
Thanks for sharing!
Now let's test how it works :)

Well done

perusio's picture

Brian I'll look into it. I would prefer to avoid Boost altogether. It's another cron process and moving part at the application layer. I prefer to do it at the server level. It scales much more easily.

Expire

mikeytown2's picture

Expire does support cck node referrers; this is correct.

Expire is pretty dumb at the moment in comparison to boost. In order to make it smarter that will require a database. If I have a database then I can record every query string that is used for a url and expire all without the need of a wildcard. With a database I can track views/panels and custom hook_menu implementations that use data in a node. Right now I'm still working on boost's logic with 404/403/301/302/503 as these are tricky to handle with caching; and as we all know, paths do change so handling this correctly is important. FYI, D7 makes handling non 200 server responses a lot easier. I'm still working on argument handling in boost/views and things like date fields in views filters. And the big one is performance of the site with the newly added database operations to record where node data is used. I will say this, because everything is standardized in Drupal around the node/entity cache expiration logic is a lot simpler; that being said it is still a very hard problem to solve; BUT with Drupal it can be "solved". When things work correctly, we enjoy very long cache lifetimes of 10 days because when things change the correct paths get purged from the cache. This is still getting fine tuned, once I'm reasonably happy with boost I plan on bringing it over to expire; database performance is key.

Thanks for the update,

brianmercer's picture

Thanks for the update, mikeytown2.

The combination of Expire and Purge at least brings us up to the level of Wordpress plugins like http://wordpress.org/extend/plugins/nginx-proxy-cache-purge/ and http://wordpress.org/extend/plugins/nginx-manager/.

There's no question that Drupal sites can get more complex than Wordpress and that the Expire module can't be expected to anticipate every possible cached page. Expire and Purge work great... if the site is simple enough.

Adding more php code and more db writes and lookups to overcome a limitation of the proxy may not be the best use of your time. Perhaps keep the Drupal side as light and efficient as possible and let the proxy carry more of the weight.

Looking at the Boost config

brianmercer's picture

Looking at the Boost config page, I think I need another module that doesn't exist. I need a way to exclude pages from the cache based on a whitelist (for special pages, pages with captchas, etc.) and if the page has messages on it. That function doesn't exactly belong in Expire or Purge. I could include the query check and the logged in user check in there as well.

The module just has to send a special header if any of those conditions exist. I'll have to take a look.

Yes

perusio's picture

One that takes a textarea where pages are listed and sends the X-Accel-Expires: 0 header to the proxy for them. Something along these lines?

Exactly.

brianmercer's picture

Exactly. Also maybe exclude ajax queries.

I'm thinking

perusio's picture

that perhaps an approach like cache actions might be one worth exploring. I've browsed the expires module and most of the functionality seems to be available out of the box or can be implemented by extending the cache actions module, the same goes for purge.

This is just an idea. I know that rules is "monster" but is also extremely flexible and has clean API that can be used to further extend the module.

Very True

mikeytown2's picture

One thing though, I like to be lazy when it comes to configuration (can't tell from boost right?). If the correct settings can automatically be set then why waste time configuring for that use case when one can program it to detect the correct settings for 99% of use cases. Having rules for expire does sounds like the right step though, allows for use cases I haven't coded yet or the 1% edge cases. Someone created a patch in Boost that I committed for rules integration; I should clean it up and bring it over.

Rules & Drush

mikeytown2's picture

Looking at the code in boost, I should be able to knock at least one of these out fairly quickly.
http://drupal.org/node/1054580
http://drupal.org/node/1054584

Rules is in

mikeytown2's picture

http://drupal.org/cvs?commit=496736

Example input

http://www.example.com/
node/400
<front>
contact-us

If the path given is a node it will load the node and expire it; this works node aliases as well. Full url's are passed though untouched.

This patch needs some follow up, code path could be better. Also expire needs to not make so many assumptions & it's pretty verbose right now.

Thanks

perusio's picture

mikeytown2. I wil try these developments this next weekend. I want to expand my config to use Nginx proxy cache. I'm getting requests for reverse proxy support.

Like I said above is high time have a suite of modules for Nginx. I fail to see the fascination with Varnish. Nginx provides all that Varnish provides, AFAIK, and more: load balancing, FastCGI cache, upstream IP hashing, &c.

be nice if you were on the

AntiNSA's picture

be nice if you were on the barracuda/ocopus team so that would be a part of everything; I dont know what your duplicating/ and or what is incompatible.....

Who's

perusio's picture

on that team? I think it to be a project authored by Grace (omega8cc).

My only issue with that is that is a project that, AFAIK, is oriented towards using Aegir. I've never used Aegir. To me it looks like a UI built on Drupal that just deploys and manages a bunch of Drupal sites.

I'm much more of a command line kind of guy. And you can get all that Aegir provides using puppet or chef with drush. Now if Aegir had a dashboard that showed all your sited data, like Monit summary/status, Munin graphs, break in attempts on the firewall, that would be interesting.

I don't think it provides that. Perhaps I'm wrong.

SqyD's picture

Hi,

I am the author of the purge module. I was already working on drush and rules integration for purge.
For the rules integration I decided to submit a patch to the cache_actions module to keep these actions in one place. I haven't done this yet.
For the drush integration I already had some basic functionality and have some options in mind (proxy settings per request) that would make integration into the drush functionalty of expire impossible. A little overlap won't kill us I guess...

Paul K

Update: Drush in Expire, Native Nginx support in Purge

SqyD's picture

Over the past few weeks I've done the following on these topics:

Nginx

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: