Cache Warmer Customizations

Events happening in the community are now at Drupal community events on www.drupal.org.
EndEd's picture

Hi, we are using "cache_warmer" module to make HTTP requests to some (a lot of) pages
so BOOST cache files are created and our cache is always fresh and warm.
We use only the HUB PAGES file.

We are having problems trying to crawl pages in PARALLEL mode, but not in SINGLE mode.

  • Problem_1
    In our site, we redirect all anonymous users to a Welcome page if they don't have a "saw_welcome" Cookie (which is set after visiting that page).

  • Problem_2
    We have some paths we want to crawl that use parameters in the URL like the following examples:
    catalog?brands[]=1545&sort_by=title&sort_order=ASC
    catalog?categories[]=1260&sort_by=title&sort_order=ASC
    product/my-first-product?product_id=426
    product/my-second-product?product_id=426

We managed to do all this in the SINGLE function by adding a new cURL option "CURLOPT_HTTPHEADER":

<?php
function cache_warmer_crawl_single($base_uri = '', $uris = array(), $hub_pages = '', $timeout) {
 
// ...
  // cURL request basic options.
 
curl_setopt_array($ch,
                    array(
CURLOPT_NOBODY => TRUE,
                         
CURLOPT_TIMEOUT => $timeout,

                         
// New cURL option.
                         
CURLOPT_HTTPHEADER => array(
                           
'Cookie: DRUPAL_UID=0; saw_welcome=1',
                           
'Cache-Control: private, no-cache, no-store, must-revalidate, max-age=0',
                           
'Pragma: no-cache',
                          ),
                    ));
 
// ...
}
?>

Adding the previous code will make each single request to be made with my custom Cookie (so Problem_1 is resolved).
Also Problem_2 is not a problem in this SINGLE function, as I can see that the BOOST files are created with no problem for
those paths including parameters.

So trying to make the same thing in PARALLEL mode I modified the function in the same way as the single one:

<?php
function cache_warmer_crawl_multiple($base_uri = '', $uris = array(), $hub_pages = '', $timeout, $parallel, $crawler_uri) {
 
// ...
  // cURL request basic options.
 
curl_setopt_array($ch,
                    array(
CURLOPT_POST => TRUE, // POST request.
                         
CURLOPT_TIMEOUT => $step_timeout,
                         
CURLOPT_RETURNTRANSFER => TRUE,
                         
CURLOPT_URL => $crawler_uri,

                         
// New cURL option.
                         
CURLOPT_HTTPHEADER => array(
                           
'Cookie: DRUPAL_UID=0; saw_welcome=1',
                           
'Cache-Control: private, no-cache, no-store, must-revalidate, max-age=0',
                           
'Pragma: no-cache',
                          ),
                    ));
 
// ...
}
?>

This doesn't work at all.
No cache files are created until I comment my code for redirecting if the user doesn't have the "saw_welcome" Cookie.
If I comment my redirection code, all files are created but we end up with our second problem.
All paths with parameters creates without them (the parameters) even if I comment my code for redirecting. so for example:

catalogo_brands[0]=1247&sort_by=title&sort_order=ASC.html

creates as:

catalogo_brands[0]=1247.html

I see this portions of code inside the PARALLEL function...

<?php
// ...
// Fill in the POST data array.
for ($j = 0; $j < $parallel; $j++) {
 
$post_data["data$j"] = $all_uris[$j + ($i * $parallel)];
}
// ...
// Make the POST request.
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data, '', '&'));
// ...
?>

...but can't see how I can change it.

Does the PARALLEL mode accept the CURLOPT_HTTPHEADER option?
Does the PARALLEL mode accept URLs with parameters?

Thanks in advance :)

Comments

That doesn't work because

perusio's picture

the parallel mode works by POSTing the URIs to be crawled to the lua location. So you need to POST the options and extract them at the Lua level. This needs to be fixed, so that you can set header options. Can you try replacing the cache_warmer_requests.lua script furnished with cache_warmer by this: https://gist.github.com/2870811 and report back.

Thanks, it work for the

EndEd's picture

Thanks, it work for the cookie part of the problem, now the boost caches are created but what need to be created like this:

catalogo_brands[0]=1247&sort_by=title&sort_order=ASC.html

is created like:

catalogo_brands[0]=1247.html

so no parameters :/

That's probably

perusio's picture

because http_build_query percents encodes the URIs. So the URI to be hit has a lot of % signs. And I guess boost doesn't handle url encoded URLs. Could you check the server logs?

If that's the case, can you replace

perusio's picture

cache_warmer_requests.lua by this: https://gist.github.com/2871262 and report back?

We had no lucky with this

EndEd's picture

We had no lucky with this one. It behaves the same with that code change.

this is the crawler vhost access log

1127.0.0.1 - - [05/Jun/2012:00:11:29 +0100] "POST /cache-warmer HTTP/1.1" 200 76 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:11:43 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:09 +0100] "POST /cache-warmer HTTP/1.1" 200 76 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:23 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:51 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:51 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:14:45 +0100] "POST /cache-warmer HTTP/1.1" 200 39 "-" "-"

nothing to post in the crawler vhost error log.

Also the --hub-pages file is

EndEd's picture

Also the --hub-pages file is more or less like this...

catalogo
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=5
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=6
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=7
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=8
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=9
catalogo?categories[0]=1311&sort_by=title&sort_order=ASC
catalogo?categories[0]=1447&sort_by=title&sort_order=ASC
catalogo?categories[0]=1423&sort_by=title&sort_order=ASC
catalogo?categories[0]=1268&sort_by=title&sort_order=ASC
catalogo?categories[0]=1336&sort_by=title&sort_order=ASC
catalogo?categories[0]=1400&sort_by=title&sort_order=ASC
catalogo?categories[0]=1249&sort_by=title&sort_order=ASC
catalogo?categories[0]=1373&sort_by=title&sort_order=ASC
catalogo?categories[0]=1258&sort_by=title&sort_order=ASC
catalogo?categories[0]=1284&sort_by=title&sort_order=ASC
catalogo?brands[0]=1425&sort_by=title&sort_order=ASC
catalogo?brands[0]=1393&sort_by=title&sort_order=ASC
catalogo?brands[0]=1406&sort_by=title&sort_order=ASC
catalogo?brands[0]=1412&sort_by=title&sort_order=ASC
catalogo?brands[0]=1292&sort_by=title&sort_order=ASC
catalogo?brands[0]=1504&sort_by=title&sort_order=ASC
catalogo?brands[0]=1386&sort_by=title&sort_order=ASC
catalogo?brands[0]=1473&sort_by=title&sort_order=ASC
catalogo?brands[0]=1460&sort_by=title&sort_order=ASC
catalogo?brands[0]=1247&sort_by=title&sort_order=ASC
producto/la-piramide-prohibida?product_id=11860
producto/pacto-de-amor-pasion-y-aventura?product_id=11861
producto/lr6aa-alcalinas-panasonic-xtreme-power?product_id=11862
producto/lr03aaa-panasonic-powermax3?product_id=11863
producto/lrv08-alcalina-panasonic-powercells?product_id=11864
boletines-de-noticias-de-gmidos
boletin-de-noticias/boletin-de-noticias-de-febrero-2012
ayuda/ayuda-general
ayuda/faq
ayuda/terminos-de-uso

Is extrange because we just

EndEd's picture

Is extrange because we just see in the nginx wiki the ngx.unescape_url command and it seem to do in paper exactly what we need :(

Can you add the following code to

perusio's picture

In the main loop of cache_warmer_requests.lua do:

-- Loop over the post_data table (contains the URIs to be hit).
for _, u in pairs(post_data) do
   -- All requests are HEAD requests.
   ngx.log(ngx.ERR, 'uri: ' .. u)
   ngx.log(ngx.ERR, 'uri_u: ' .. ngx.unescape_uri(u))
   table.insert(requests, { build_req_uri(base_uri, ngx.unescape_uri(u)), { method = ngx.HTTP_HEAD }})
end

And post the error log so that we can see what it's doing?

First of all, thanks for keep

EndEd's picture

First of all, thanks for keep helping us with this...

Here is a portion of the crawler vhost error log:

2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?categories[0]=1400&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?categories[0]=1400&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?categories[0]=1249&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?categories[0]=1249&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"

It seems that the URIs are correctly constructed

perusio's picture

so unescaping it is spurious. The problem is downstream from here. Be it on the multi request handling, be it on the Boost side.

Can you post the logs of the server being hit? I'd like to see what URI is hit.

Thanks,

we put

EndEd's picture

we put this:

drush_log(http_build_query($post_data, '', '&'), 'error');

in the funtion cache_warmer_crawl_multiple inside the main loop and we get this kind of log on the console:

data0=&data1=catalogo&data2=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20&data3=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D1&data4=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D2&data5=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D3&data6=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D4&data7=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D5&data8=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D6&data9=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D7&base_uri=http%3A%2F%2Fes.gmidos.nx

finally we make one step

EndEd's picture

finally we make one step forward. With this code inside cache_warmer_requests.lua...

-- Loop over the post_data table (contains the URIs to be hit).
for _, u in pairs(post_data) do
   -- All requests are HEAD requests.
   u = ngx.re.gsub(u, '&', '%26', 'i')
   table.insert(requests, { build_req_uri(base_uri, u), { method = ngx.HTTP_HEAD }})
end

...works ok but the url thats need to be created like this:

catalogo?brands[0]=1247&sort_by=title&sort_order=ASC

creates like this :
catalogo_brands[0]=1247%26sort_by=title%26sort_order=ASC.html

the problem seem to be the '&'...

It seem that

EndEd's picture

It seem that the...

catalogo_brands[0]=1247%26sort_by=title%26sort_order=ASC.html

...generated pages, were badly generated. All are views pages and after boost caches them it seem that all files sizes are the same and the html generated has problems (views empty results) so that piece of code dont work.

I look on my Boost stats and

superfedya's picture

I look on my Boost stats and there is 12000+ pages O___O It's normal?

Have any of you tried

perusio's picture

microcaching as opposed to Boost? It simplifies your setup and makes it faster. One less thing on the drupal side. Move the cache to the server layer.

I suppose i should give that

EndEd's picture

I suppose i should give that a try. Im a total noob in Nginx but im using your drupal-nginx config that came with microcaching by default. I will report back then.

Ok, we made our first try in

EndEd's picture

Ok, we made our first try in setting up the microcaching system... with no luck :)

We have Boost uninstalled and we are using your nginx configuration with microcaching activated for anon users in drupal 7.

The only change we made in the files (only for debugging purposes) was to change this 2 lines:

fastcgi_cache_valid 200 301 60m;
fastcgi_cache_valid 302 60m;

We want to use this in conjuntion with "cache_warmer" and not using microcache per se but use cachewarmer frecuently enough and Nginx cache with a big TTL. Before we tried to test cache warmer, we tryed to browse directly.

When browsing normally the site as anonymous and viewing the headers in Firebug we get "X-Micro-Cache:MISS" in the Response Header for all pages.

Also if we go to the directory where cache files should be saved? /var/cache/nginx/microcache there are no files inside.

Truth is that we have some custom Cookies. Maybe this is too much but we will try to explain some of them:

  • saw_welcome:
    This Cookie is set to "0" for all users (anonymous and authenticated) the first time he visits any of our pages.
    In every page load we look at this Cookie and if the user is ANONYMOUS and the Cookie is "0" (or is not set) the user is redirected to a welcome page.
    If the user submits the form in the welcome page, this Cookie is set to "1" and the user is redirected to the page he was trying to see.
    All subsequent page loads will see that the user has this Cookie set to "1" and will not redirect him again.

  • has_qtip:
    We use the jQuery QTips plugin so we can use custom tooltips in our links.
    This Cookie is set to "1" for all users (anonymous and authenticated) the first time he visits any of our pages.
    In every page load we look at this Cookie and if the user is AUTHENTICATED and the Cookie is "1", we add a
    special class "qtip" to all links in our page so JavaScript can get those links and apply jQuery QTips to them.
    The point of this Cookie is that we have a TOOGLE link on top of every page that the user can click and ENABLE/DISABLE this feature.
    So everytime the user clicks on this link, JavaScript toggles the Cookie value from "1" to "0" or from "0" to "1".
    With the Cookie at "0" the user will see the normal browser's tooltip.
    This TOOGLE link is only presented to AUTHENTICATED users so anonymous will always see the jQuery QTips tooltips (as they have the Cookie set as "1").

We have 3 more Cookies (very similar to "has_qtip" Cookie. They are TOGGLES for accesibility/usability purposes) but you get the idea.
This was all working good in Boost, as Boost doesn't care about Cookies and all cached files were created successfully.

So our question are:

1) Is there a way of configuring microcaching so pages are cached and served even if there are Cookies?
So for example, in the case of the "has_qtip" Cookie, the pages should be cached and served with the special class "qtip" in all links. We could get rid of all of our custom Cookies except the "saw_welcome" one, but we think this will not be any solution as we can see other Drupal Cookies (and Google Analytics).

2) We are trying to get it work like boost so is there a way to tell the microcache which pages we want to be cached or which wont using something like wildcards inside your nginx config?

We are trying a lot of things by reading the "HttpFastcgiModule" help pages (http://wiki.nginx.org/HttpFcgiModule) but this is all brand new to us.

Thanks in advance, you have been very helpful already.

It shouldn't be a problem

perusio's picture

unless you have a session cookie. It's the only cookied that bypasses the cache. Here's a little example in a D7 site:

curl -I -H 'Cookie: sam_welcome = 1' http://d7
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 06 Jun 2012 09:00:46 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Keep-Alive: timeout=10
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Last-Modified: Wed, 06 Jun 2012 09:00:42 +0000
Cache-Control: no-cache
ETag: "1338973242"
X-Protected-Asset: Y
Content-Language: en
X-Micro-Cache: HIT

and in the debug log the FCGI request params show the cookie untouched:
2012/06/06 11:00:42 [debug] 21082#0: 2547 fastcgi param: "SCRIPT_FILENAME: /var/www/sites/d7/index.php"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_USER_AGENT: curl/7.25.0 (x86_64-pc-linux-gnu) libcurl/7.25.0 OpenSSL/1.0.1c zlib/1.2.7 libidn/1.24 libssh2/1.4.0 librtmp/2.3"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_HOST: d7"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_ACCEPT: */
"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_COOKIE: sam_welcome = 1"

Try the

perusio's picture

microcache_fcgi_auth.conf instead. It handles the Cookie and Set-Cookie headers differently. They're not analysed at all, they're passed untouched to the backend.

Hi Perusio... We gets HITS

EndEd's picture

Hi Perusio... We gets HITS when we use microcache_fcgi_auth.conf but ONLY if we modify this line.

fastcgi_ignore_headers Cache-Control Expires;

to

fastcgi_ignore_headers Cache-Control Expires Set-Cookie;

I´m begining to undestand the logic of all this. This works great if we really use a microcaching (15s - 5s) aproach because it caches everything for so little time that barelly maters except for scale. In our case, as we are trying to mimic boost and we have a ecommerce site we were thinking more of an aproach where fastcgi_cache_valid for 12h and cache warmer doing his job with the frecuency we find usefull. Because we have it configured like that (12h) the admin pages keep getting HITS for admin roles and thats not good so as with boost we need a way to tell nginx to only caches certain pages and posibly to only cache anonymous users.

1) Our original Boost cacheability settings -> Cache specific pages -> Only the listed pages ->

home*
catalogo*
contacto*
producto/*
ayuda/*
boletines-de-noticias-de-gmidos
boletin-de-noticias/*

as boost support wildcards

2) The second part is trying to get it work with an anonymous config and making exceptions (no_cache) if the cookie is set or not (we still trying to figure out how to do this

As we said before, we need to mimic how it worked when boost was installed.

any insight to any of this will be much apreciated

After learning a lot of what

EndEd's picture

After learning a lot of what your config codes do, we have managed to microcache our commerce site for Anonymous users.

Now we want to limit the cache to only specific pages as we had in Boost (so pages like "/admin" or the Shopping Cart, CheckOut pages, Profile pages, etc... will remain uncached).

We have the following code in "microcache_fcgi.conf":

set $no_cache "0";
if ($uri !~* "^/(catalogo|contacto|producto/|ayuda)") {
  set $no_cache "1";
}
fastcgi_no_cache $no_cache;
fastcgi_cache_bypass $no_cache;

The only problem we have is that we can't seem to cache the HOME page.
We have tried a lot of possible combinations for calling this page:
""
"/"
"/home"
"$document_root"
"^/"
"^/$"
""
...etc...

We have also tried with the following conditional without luck:

if ($uri = "ALL-COMBINATIONS-WE-CAN-IMAGINE-FOR-CALLING-THE-FRONT-PAGE") {
  set $no_cache "0";
}

We have also tried using $request_uri instead of $uri with the same luck.

In Drupal (/admin/config/system/site-information) we have configured the front page to be "/home".

So is there an easy way of checking the front page $uri?
Thanks in advance.

No need to use an if

perusio's picture

you can add a map directive for that in map_cache.conf:

map $uri $no_cache_uri {
    default 1;
    / 0;
    ~^/(?:catalogo|contacto|producto|ayuda) 0;
}

And then add:
fastcgi_no_cache $no_cache $no_cache_uri;
fastcgi_cache_bypass $no_cache $no_cache_uri;

Try it.

Thank you again perusio and

EndEd's picture

Thank you again perusio and sorry for the late response. We are using "map" now to make all of our conditions inside "map_cache.conf" and all is working good.

Well, of course we are having more troubles :) This time are the form submits and Drupal messages.

A couple of days ago we were having troubles login in the site with the login block (it's on every page on our site). After login, the page was refreshed but you still were anonymous until you go to an un-cached page. So we end up redirecting this form to the user's profile page (which is not cached).

All good until we tried to use othe forms in our site, like the contact form, the "subscribe to our newsletter" block (simplenews), the "send this product to a friend" form (forward), etc... After submiting any of these forms, a message is sent (for example the "Your message was sent succesfully" in the contact form).

The problem is that:
- If you submit the form in a already cached page, no message is visible until you reach an un-cached page (where it then displays).
- If you submit the form in a non-cached page, the message is displayed, but the page is cached so other anonymous that go to the same page will see that message.

We understand it's a POST problem? We have a map as the following:

map $request_method $no_cache_method {
    default 0;
    POST 1;
}

and then:
fastcgi_no_cache $no_cache $no_cache_uri $no_cache_method;
fastcgi_cache_bypass $no_cache $no_cache_uri $no_cache_method;

but this is not working at all.

Do we have to alter all forms to add a "?nocache=1" query to the redirection and use the $arg_nocache variable?

fastcgi_no_cache $no_cache $no_cache_uri $arg_nocache;
fastcgi_cache_bypass $no_cache $no_cache_uri $arg_nocache;

Will try tonight some more tests but we were wondering if there's another way. What we need is to bypass saving/serving a cache if a form was submitted.
Thanks again :)

Hmmm

perusio's picture

By default the cache methods are only HEAD and GET.

If a user submits a form, then it's a POST and it's not cached. The problem is that the message appears after the POST, hence it's a regular GET request to get that page.

We need a way to keep state. My suggestion is to create a cookie with a short life time that is set by the form submit handler and that pierces the cache.

Check this module: http://drupal.org/project/cookie_cache_bypass_adv

I think it solves your problem.

Set

perusio's picture

fastcgi_no_cache $no_cache $no_cache_uri $cookie_NO_CACHE;
fastcgi_cache_bypass $no_cache $no_cache_uri $cookie_NO_CACHE;

for that module.

Ok, you are officially our

EndEd's picture

Ok, you are officially our hero now :)

The code from cookie_cache_bypass_adv module is simple enough to grab it and add it to one of our custom modules (one that manages all cache related things).

The NO_CACHE Cookie is set after form submits and $cookie_NO_CACHE is working as expected. We even modify again our login block to not redirect to the user profile page and it worked :)

Just a quickie here, what do you think is the best Cookie Expire Time for this Cookie? By default is 300 seconds which we think is too much so we made it 10 seconds.

Also you said that by default only GET/HEAD are cached. Do you mean that our map:

map $request_method $no_cache_method {
    default 0;
    POST 1;
}

is not needed?
We initially added it because the "cache_warmer" PARALLEL option is using POST and we thought that this method was the trouble maker.

Thanks again.

Not needed indeed

perusio's picture

because the default setting of the fastcgi cache is to cache only for GET and HEAD requests.

You should choose the value that suits you better. 10 seconds seems reasonable. That way the user gets the message as it should and quickly returns to the cached setup.

Thanks :)

EndEd's picture

Thanks :)

Hi again. We still having

EndEd's picture

Hi again.

We still having problems trying to crawl URIs like the following with the PARALLEL crawler:
catalogo?brands[0]=1500&sort_by=title&sort_order=ASC
Those URIs get crawled with only the first parameter:
catalogo?brands[0]=1500

This does not happend in the SINGLE crawler as stated before, so we were using the SINGLE for these types of URIs and the PARALLEL for the normal ones.
Till here we are good, we don't mind using the 2 crawlers.

The problem is that cache files created in the SINGLE crawler are empty. Well, not really empty, but only with the headers. If you go to that crawled page, you see a blank page.

The only thing we came up is modifying the cUrl options array and setting the CURLOPT_NOBODY option to FALSE:
CURLOPT_NOBODY => FALSE
After this cache files are full with the headers AND the html, and if you go to the crawled page you see it as it should be.

We thought that all was good, but yesterday we started creating log files.
Our log file (adding > path/log_file.log to the Drush command) is getting filled with HTML from all crawled pages, so it ends up being 90Mb.
Inside the log, if we scroll down to the end, we can see the JSON responses returned by cache_warmer_execute.

Setting again CURLOPT_NOBODY => TRUE makes the log with only the JSON responses, but makes our cache files empty.

Any ideas what could be the cause for the response HTML getting inside our logs?
Thanks in advance

Yes

perusio's picture

Because you've set CURLOPT_NOBODY => FALSE now it returns the body. Try setting the cURL options:

CURLOPT_NOBODY => TRUE
CURLOPT_HTTPGET => TRUE

I suspect those pages require a GET to be generated.

Sorry that didn't do

EndEd's picture

Sorry that didn't do it.
Cache files are ok but the log file is still getting full of HTML.

Seems that returning the body automatically fills the Drush log with those bodies.

Will keep trying, thanks again

Ok we ended up resolving this

EndEd's picture

Ok we ended up resolving this by using:

CURLOPT_NOBODY => TRUE,
CURLOPT_HTTPGET => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,

Now both the cache files and log file are ok.
Makes any sense to you?

It does

perusio's picture

What I'm curious about is why the URIs with several & are not handled properly by the lua client. Could be some lua socket quirk though. That's code that is getting long in the tooth. I would like to move it to the Nginx Lua cosocket API as soon as possible.

It won't be able to do massive parallelization (thousands of requests in parallel) with lua socket.

This is work in progress.

HTTPRL

mikeytown2's picture

Might want to work on HTTPRL integration as it can handle massive parallelization from my tests. I currently limit it to 128 global connections and 8 domain connections but these values can be changed. http://drupal.org/node/1426856

perusio's picture

I'll have more time to deal with after devdays. It will be cool to compare the resource usage and speed of httprl and Lua cosocket. I don't think PHP can compare with Lua. No JIT, much larger language, I don't think it goes beyond select() at the I/O event notification layer. My main interest in having httprl is that is an alternative to following the Lua route.

OTOH I think that httprl will probably beat cURL multi easily in terms of performance.

So the move to lua cosocket

EndEd's picture

So the move to lua cosocket will permit massive parallelization on the contrary of the actual lua socket method? Is this correct?

With that move do you think the & URIs problem in the hub pages could be resolved?

Any views page with exposed filters are suffering now so we have to use the single curl way which is ok but not great as you could imagine :)

I don't think

perusio's picture

it can handle thousands of simultaneous connections. The weak link is l luasocket. With cosocket yes becauses it uses the event loop made available by the Nginx API.

Hi again, more tests trying

EndEd's picture

Hi again, more tests trying to crawl URIs with several & in the parallel crawler:

In cache_warmer_client.lua we can see that ngx.var.arg_u comes already with only the first parameter (removing the first & and everything after that).
So as you said, maybe it's a problem of ngx.location.capture_multi().

We end up modifying cache_warmer_requests.lua hardcoding paths following guidance from the HttpLuaModule help pages:

ngx.req.read_body()
local post_data = ngx.req.get_post_args()
local base_uri = post_data['base_uri']
post_data['base_uri'] = nil
local requests = {}

-- HARDCODED PATHS
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1425&sort_by=title&sort_order=ASC", { method = ngx.HTTP_HEAD }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1393", { method = ngx.HTTP_HEAD, args = "sort_by=title&sort_order=ASC" }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1412", { method = ngx.HTTP_HEAD, args = { sort_by = "title", sort_order = "ASC" } }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo", { method = ngx.HTTP_HEAD, args = "brands[0]=1406&sort_by=title&sort_order=ASC" }})

local responses = { ngx.location.capture_multi(requests) }
for _, r in pairs(responses) do
  ngx.say(r.status) -- get the status only (HEAD)
end

All paths except the last path created a cache file with only the first parameter (eg. "/catalogo?brands[0]=1425").
The last path created a cache file without parameters at all ("/catalogo").

We also tried using ngx.HTTP_GET as method with the same luck.

Same thing happends using ngx.location.capture() as following:

ngx.req.read_body()
local post_data = ngx.req.get_post_args()
local base_uri = post_data['base_uri']
post_data['base_uri'] = nil
local requests = {}

local responses = { ngx.location.capture("/x-parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1425&sort_by=title&sort_order=ASC", { method = ngx.HTTP_HEAD }) }
for _, r in pairs(responses) do
  ngx.say(r.status) -- get the status only (HEAD)
end

So both capture and capture_multi have the same problem?

Hmm

perusio's picture

have you tried the last URI with a GET request instead?

This needs to be further investigated.

Yes, IIRC we tried all 4 URIs

EndEd's picture

Yes, IIRC we tried all 4 URIs with HEAD and GET with the same results. Will try again this weekend, though.