Can page cache be configured to ignore certain query parameters?

Events happening in the community are now at Drupal community events on www.drupal.org.
scottatdrake's picture

When our newsletters go out, links are appended with user-unique tags for tracking purposes and whatnot. Something like example.com/page?userID=1234. It is my understanding that these unique values are all treated as unique pages and effectively bust the cache.

We experience big spikes in web server and database load as soon as these newsletters are sent out. Is there any way to keep those queries in the url, but have the caching mechanism ignore specific ones?

I feel like someone must have run into this before, but my google-fu is coming up short.

Comments

Questions

mikeytown2's picture

Why do you need the userID value set? Is this for drupal or for google analytics? Running Apache?

2 options to solve this. Create a rewrite rule at the server level to strip userID=1234 or in setting.php strip it out & use hook_boot to add it back in.

Varnish?

cashwilliams's picture

I read this differently. Are you using Varnish and trying to get it to respond even with ?userID=1234?

Do you need that userID info?

ngaur's picture

I've dealt with something similar to this where a link to a movie file, linked from within a flash applet had a random number in an URL argument in order to bust the cache on the user's browser. Trouble was that we were delivering that file about 15 times a second at peak, and we needed caching to work on our server.

We were using Squid as a caching reverse proxy server. By doing an URL rewrite in Squid, we were able to strip the parameter out of the incoming URL, map all these separate URLs from the users onto one URL going to the web server, and in fact squid would get that from it's single cache entry, rather than asking the web server at all. Problem solved.

However, this means that your web server never sees the variable argument in the URL at all. Think about what that means for you. If you block any GET parameter called userID across your whole site it's bound to break something. You'll want to only do this for some range of URLs. eg exclude all of /admin/ from this rule. Search your logs for problem areas. ie search for hits with Referer on your site (ie exclude the mail links) and with 'userID=' in the URL. if the naming produces too many clashes, you might want to change the name in the links in your outgoing mail.

You can presumably do much the same with varnish, nginx, or whatever else you might use as a caching proxy. If you don't use a proxy at all, you could still do it with mod_rewrite in apache. Much less efficient of course, but it will mean that the drupal page cache will be able to do its job.

You can setup Varnish to

dalin's picture

You can setup Varnish to ignore query parameters pretty easily. You can then use some JS on the page to gather the data and submit it via an AJAX request.

Add this somewhere in vcl_recv()

  # Strip various GET params that are only used by JavaScript and should not
  # cause multiple versions to be stored in the Varnish cache.
  set req.url = regsuball(req.url, "([\?|&])userID=[^&\s]*&?", "\1");
  # Remove any trailing & or ?
  set req.url = regsuball(req.url, "[\?|&]+$", "");

--


Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his

Watch out for caveats

blazindrop's picture

I had the same exact scenario come up in our organization where load on the sites increased as newletters were deployed. Dalin's approach is pretty elegant and probably a little simpler than what I have implemented (which was some parameter stripping login in vcl_hash()). The benefit with the latter is the URL would remain the same so Drupal could "see" the entire query string, but the hashing essentially ignored any parameters that busted the cache. You need to be careful with this approach because it can introduce side effects with the Drupal cache and exposing these tracking links to everyone that visits your site.

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: