Using a simple, cross-platform static cache to cut server load by 85%

Events happening in the community are now at Drupal community events on www.drupal.org.
chipk's picture

I work on a newspaper site (http://www.gazettenet.com) and have worked out a simple static cache'ing technique outside of Drupal that others might find helpful, assuming similarities in traffic patterns to typical newspaper sites.

We found that around 40% of all pages served by the system were from a relatively small number of URL targets, namely the home page (25%) and a handful of landing pages (local news, obituaries, sports, etc.). In addition, because landing pages are typically content-rich, they are also often the most expensive to serve from a performance standpoint. In our case, this same small collection of landing pages accounted for more than 85% of total server load.

With that understanding in hand, we saw a huge performance boost opportunity if just those few pages could be served from a static cache directly by Apache - i.e. no call to mod_php, drupal, or mysql.

The technique we devised has three pieces:

  1. a cron job on the local server using 'wget' to build a cached page for each of the target URL's
  2. Apache .htaccess code to differentiate between the calls from #1 above and all other calls
  3. client-side JS that calls back to the server for user login status

The basic idea is that the 'wget' calls from #1 are passed through to drupal which builds and returns the pages for the target URL's. Those returned pages are stored on disc as the static cache. The .htaccess code in #2 can differentiate between the wget calls (i.e. coming directly from the localhost IP) and calls from users, and so serves the pre-built pages from the static cache for all users. The JS in #3 is needed in our case, so that the static pages can be updated to reflect login status for the user.

Parts #1 and #2 are very easy to implement. Part #3 is fairly complex, but could be eliminated if you don't need to display any user-specific data or status on the cache'd pages.

Overall, the system is very reliable and resulted in dropping server load by 85%. I'd be happy to share details if others might be interested.

Comments

why not boost?

jredding's picture

Did you give any thought to using boost to take care of 1 & 2?

-Jacob Redding

-Jacob Redding

chipk's picture

I was interested in working out a solution that could give us the full throughput of an Apache-only static cache - i.e. no invocation of mod_php or load of any part of the drupal framework. It ended up being simple to do and has the advantage of being platform-neutral - i.e. the technique can be applied to any site using any framework/platform. Also, because the complex bit is updating the cached page with user-specific data, most of the hard work is in that part... not sure whether boost.module addresses that part of the problem.

That's just what boost does, though

bhuga-gdo's picture

Boost makes a static copy of the page and changes Drupal's normal .htaccess file to redirect queries from anonymous users to cached pages without touching php in any way.

It's not platform neutral, but there's no good way to do a platform-neutral cache like this; Boost can be configured to refresh the cache a number of ways, some of which are 'aware' when content changes. This can't be done on an external system like you describe.

Further, there's no need for client-side js: boost checks for drupal session cookies and will correctly send logged-in users to non-cached content. Further, Boost will automagically cache anything an anonymous user fetches, instead of the wget system you describe, in which you hand-pick the pages to cache; such a system would not save you from a deep link on digg or the front page of google news.

I can't be sure about your situation, so the platform-neutrality (within the scope of anything that runs on Apache) may overcome these benefits, but I would suggest taking another look at boost.

needed cache to serve all users

chipk's picture

Caching that works only for anonymous users would be a non-starter for us. And caching every page wasn't a requirement - we just wanted to cache the handful of landing pages that accounted for 85% of server load. All other pages - deep links included - are served dynamically. The MySQL DB cache does more than enough of a good job optimizing the builds of those content-limited pages. Just-in-time cache building is a nice feature if you are caching a huge number of pages, but was unnecessary for us to bring our server load down to a whisper.

Proxy Cache

mukesh.agarwal17's picture

hi Chip,

I'm not an expert in this area, but I would suggest using some proxy cache, like Squid. You can serve the home page and the landing pages via proxy server's cache. You could optionally write a small shell script which purges these cache elements every hour or 2 hours or whatever suits to the need. Squid is the solution that wikipedia guys use to serve their static pages.

Mukesh
www.ilovebolly.com

Cheers,
Mukesh Agarwal
www.innoraft.com

re:Squid Proxy Cache

chipk's picture

Hi Mukesh,

Thanks for the suggestion re: Squid. The technique I am using is, in effect, just that - a proxy mechanism with components for generating the cache and serving from it. I've presented the example and technique here in case it might offers some general insight around the static cache issue to others, and because it might be a technique to consider if folk's caching needs are in line with my example. I'd imagine Squid is a good solution with a lot of flexibility, but it probably also requires an investment of time and energy in learning/deploying/tweaking. Great if you need it, but maybe overkill if your needs are relatively modest compared to super-high-volume destinations like wikipedia.

  • Chip

yes, proxy cache

greggles's picture

In general I think your idea is interesting, but I'm not sure I'd follow the exact path you mention.

components for generating the cache

With something like a proxy cache, you don't need to actively generate the cache - it happens automatically and "lazily" (i.e. only when the cache data needs to be generated).

probably also requires an investment of time and energy in learning/deploying/tweaking

You're comparing a known technology (proxy caches like squid) with a homebuilt system. My preference (along with most Drupal users...that's why we use Drupal after all) is to use the known technology. Either way I have to learn or build something. With the known open source technology it will have benefits and use outside of one company...

--
Growing Venture Solutions | Drupal Dashboard | Learn more about Drupal - buy a Drupal Book

Squid sounds promising

mcurry's picture

I like the idea of using Squid (I've used it in the past on a local network to improve web browsing performance on a slow internet connection).

Anyone out there willing to share configuration tips and/or a howto on the subject of configuring a Squid proxy with Drupal?

Michael Curry
Classified Ads Module For Drupal 6 | My hangout

Hi Michael - you might get

chipk's picture

Hi Michael - you might get more traction by starting a new post - e.g. Need howto on configuring a Squid proxy with Drupal

Here it is

kbahey's picture

Here it is Increasing Drupal's speed via the Squid caching reverse proxy.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Thanks for that!

mcurry's picture

All help is appreciated...!

Michael Curry
Classified Ads Module For Drupal 6 | My hangout

would you mind sharing code?

ajayg's picture

@Chip
would you mind sharing the code? I am having situation like you and thinking caching a few pages (even for logged in users) is what I need. There is no userspecific anything on those pages so prime candidate for caching.

code implementing simple static cache for home page

chipk's picture

Sure :

----------------------------------------------------------------
1. build_static_cache.sh - shell script to build static cache
----------------------------------------------------------------

#!/bin/sh
cd /var/www/vhosts/domainname.com/httpdocs/static_cache
wget http://www.domainname.com/ -O _cache_homepage.tmp -o /dev/null
# perform atomic file overwrite to avoid user access to partially written file
mv -f _cache_homepage.tmp _cache_homepage.html

----------------------------------------------------------
2. cron entry to re-build static cache ever 5 minutes
----------------------------------------------------------

*/5 * * * * ~/bin/build_static_cache.sh

---------------------------------------------------------------
3. code in Drupal .htaccess to differentiate between calls
to build the cache and all other calls
---------------------------------------------------------------

# TARGET URL: /
# MYHOSTIP == 12.34.56.78
#
# Fail next condition for calls from the host machine
#
RewriteCond %{REMOTE_ADDR} !^12\.34\.56\.78$
#
# Check for target URL
#
RewriteCond %{REQUEST_URI} ^/*$
#
# Serve the cached page if we get through to here
#
RewriteRule ^.*$ /static_cache/_cache_homepage.html [L]

Thanks for this perspective

SerenityNow's picture

Thanks for this perspective - I am about to implement something similar and this is very useful info.

This doesn't make very much

slantview's picture

This doesn't make very much sense to me. If you absolutely had to have a server without php, why wouldn't you just use boost, use a NFS mount (you are most likely going to have to anyway if you use multiple servers) and then use 2 sets of servers, APP servers and STATIC servers, and redirect to the STATIC servers for anything that doesn't have the DRUPAL-UID cookie set by boost. This seems like a lot of reinventing the wheel. Bost will recreate the pages for you on your max lifetime and also only cache certain pages.

I wouldn't necessarily even do that though. I have not seen a scenario of a site yet that couldn't be scaled with current high performance technologies. Boost/Cache Router/Memcache/Squid/etc, plus a bit of fixing queries and indexing tables, etc.

just my $0.02,

Steve

one solution among many

chipk's picture

Sure, there are plenty of potential solutions for static caching with varying degrees of effort required. I offered this technique because of it's simplicity and the surprisingly large impact such a small, targeted solution could have. I also thought the example might be helpful to others wondering about high bang-for-effort solutions. In our case, we serve 1.2M+ pages/day from a single quad-processor box, split into two virtual boxes (apache/mysql) with 2 cores each. Conservatively, I'd guess we could serve 2x or 3x that with this technique and no change to either the software stack or hardware. If we needed to go further, I'd probably consider Squid next to cache all pages, but at this page view level it's not close to being needed.

Below are a few of the handful of processor-expensive pages we cache:

http://www.gazettenet.com/

http://www.gazettenet.com/section/hs-sports

http://www.gazettenet.com/section/umass-sports

Don't get me wrong, I think

slantview's picture

Don't get me wrong, I think this is a very creative solution, I just wonder if you had tried existing solutions before implementing this. A lot of the other existing solutions would likely have given you as much bang for the buck as you needed and are a little more tested so you don't end up with any weird problems. I do however like the idea of not hacking core and doing something creative like this. All of this info can be extremely helpful.

If your site is getting 1.2M+ page views per DAY, I can't imagine doing that without a second server. If for nothing else, for redundancy.

The other thing I don't understand is why your Alexa rank is so low if you are doing so much traffic. That is very odd. I know sites that do much less traffic than you and have a much higher Alexa score.

http://www.quantcast.com/gazettenet.com#traffic

http://www.alexa.com/data/details/traffic_details/gazettenet.com

no worries

chipk's picture

No worries - just laying out the thinking that went into the solution. I actually wanted to try a simple custom solution just to better understand the problem. Among the suggestions made, Squid would be the next step if/when we need more complete static caching. Even so, the hard part is still updating the cached pages on the fly to fold in dynamic bits of UI - e.g. LogIn/LogOut link based on auth status.

re:Alexa - I'm only guessing the comment below is a clue. Also, the 1.2M views is across 5 sites running on that same box, not just gazettenet.com.

you can forget the alexa

bennos's picture

you can forget the alexa rank. this is shit. only people with alexa toolbar installed, are tracked.

Not really

markus_petrux's picture

They also collecte data from other (unspecified) sources. :P

You may wish to look at google trends as well:

http://trends.google.com/websites?q=gazettenet.com

1.2M per MONTH..

chipk's picture

Sorry - meant 1.2M per MONTH. Wishful thinking!

Simple is good; we like simple.

ariostel's picture

Chip, this is really a simple solution, well-explained, and easy to implement. Your inclusion of the code makes it possible for anyone to try this. Thanks for the post.

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: