Hi, we are using "cache_warmer" module to make HTTP requests to some (a lot of) pages
so BOOST cache files are created and our cache is always fresh and warm.
We use only the HUB PAGES file.
We are having problems trying to crawl pages in PARALLEL mode, but not in SINGLE mode.
-
Problem_1
In our site, we redirect all anonymous users to a Welcome page if they don't have a "saw_welcome" Cookie (which is set after visiting that page). -
Problem_2
We have some paths we want to crawl that use parameters in the URL like the following examples:
catalog?brands[]=1545&sort_by=title&sort_order=ASC
catalog?categories[]=1260&sort_by=title&sort_order=ASC
product/my-first-product?product_id=426
product/my-second-product?product_id=426
We managed to do all this in the SINGLE function by adding a new cURL option "CURLOPT_HTTPHEADER":
<?php
function cache_warmer_crawl_single($base_uri = '', $uris = array(), $hub_pages = '', $timeout) {
// ...
// cURL request basic options.
curl_setopt_array($ch,
array(CURLOPT_NOBODY => TRUE,
CURLOPT_TIMEOUT => $timeout,
// New cURL option.
CURLOPT_HTTPHEADER => array(
'Cookie: DRUPAL_UID=0; saw_welcome=1',
'Cache-Control: private, no-cache, no-store, must-revalidate, max-age=0',
'Pragma: no-cache',
),
));
// ...
}
?>Adding the previous code will make each single request to be made with my custom Cookie (so Problem_1 is resolved).
Also Problem_2 is not a problem in this SINGLE function, as I can see that the BOOST files are created with no problem for
those paths including parameters.
So trying to make the same thing in PARALLEL mode I modified the function in the same way as the single one:
<?php
function cache_warmer_crawl_multiple($base_uri = '', $uris = array(), $hub_pages = '', $timeout, $parallel, $crawler_uri) {
// ...
// cURL request basic options.
curl_setopt_array($ch,
array(CURLOPT_POST => TRUE, // POST request.
CURLOPT_TIMEOUT => $step_timeout,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_URL => $crawler_uri,
// New cURL option.
CURLOPT_HTTPHEADER => array(
'Cookie: DRUPAL_UID=0; saw_welcome=1',
'Cache-Control: private, no-cache, no-store, must-revalidate, max-age=0',
'Pragma: no-cache',
),
));
// ...
}
?>This doesn't work at all.
No cache files are created until I comment my code for redirecting if the user doesn't have the "saw_welcome" Cookie.
If I comment my redirection code, all files are created but we end up with our second problem.
All paths with parameters creates without them (the parameters) even if I comment my code for redirecting. so for example:
catalogo_brands[0]=1247&sort_by=title&sort_order=ASC.html
creates as:
catalogo_brands[0]=1247.html
I see this portions of code inside the PARALLEL function...
<?php
// ...
// Fill in the POST data array.
for ($j = 0; $j < $parallel; $j++) {
$post_data["data$j"] = $all_uris[$j + ($i * $parallel)];
}
// ...
// Make the POST request.
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data, '', '&'));
// ...
?>...but can't see how I can change it.
Does the PARALLEL mode accept the CURLOPT_HTTPHEADER option?
Does the PARALLEL mode accept URLs with parameters?
Thanks in advance :)
Comments
That doesn't work because
the parallel mode works by POSTing the URIs to be crawled to the lua location. So you need to POST the options and extract them at the Lua level. This needs to be fixed, so that you can set header options. Can you try replacing the
cache_warmer_requests.luascript furnished withcache_warmerby this: https://gist.github.com/2870811 and report back.Thanks, it work for the
Thanks, it work for the cookie part of the problem, now the boost caches are created but what need to be created like this:
catalogo_brands[0]=1247&sort_by=title&sort_order=ASC.html
is created like:
catalogo_brands[0]=1247.html
so no parameters :/
That's probably
because
http_build_querypercents encodes the URIs. So the URI to be hit has a lot of % signs. And I guess boost doesn't handle url encoded URLs. Could you check the server logs?If that's the case, can you replace
cache_warmer_requests.luaby this: https://gist.github.com/2871262 and report back?We had no lucky with this
We had no lucky with this one. It behaves the same with that code change.
this is the crawler vhost access log
1127.0.0.1 - - [05/Jun/2012:00:11:29 +0100] "POST /cache-warmer HTTP/1.1" 200 76 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:11:43 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:09 +0100] "POST /cache-warmer HTTP/1.1" 200 76 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:23 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:51 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:13:51 +0100] "POST /cache-warmer HTTP/1.1" 200 51 "-" "-"
127.0.0.1 - - [05/Jun/2012:00:14:45 +0100] "POST /cache-warmer HTTP/1.1" 200 39 "-" "-"
nothing to post in the crawler vhost error log.
Also the --hub-pages file is
Also the --hub-pages file is more or less like this...
catalogo
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=5
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=6
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=7
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=8
catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=9
catalogo?categories[0]=1311&sort_by=title&sort_order=ASC
catalogo?categories[0]=1447&sort_by=title&sort_order=ASC
catalogo?categories[0]=1423&sort_by=title&sort_order=ASC
catalogo?categories[0]=1268&sort_by=title&sort_order=ASC
catalogo?categories[0]=1336&sort_by=title&sort_order=ASC
catalogo?categories[0]=1400&sort_by=title&sort_order=ASC
catalogo?categories[0]=1249&sort_by=title&sort_order=ASC
catalogo?categories[0]=1373&sort_by=title&sort_order=ASC
catalogo?categories[0]=1258&sort_by=title&sort_order=ASC
catalogo?categories[0]=1284&sort_by=title&sort_order=ASC
catalogo?brands[0]=1425&sort_by=title&sort_order=ASC
catalogo?brands[0]=1393&sort_by=title&sort_order=ASC
catalogo?brands[0]=1406&sort_by=title&sort_order=ASC
catalogo?brands[0]=1412&sort_by=title&sort_order=ASC
catalogo?brands[0]=1292&sort_by=title&sort_order=ASC
catalogo?brands[0]=1504&sort_by=title&sort_order=ASC
catalogo?brands[0]=1386&sort_by=title&sort_order=ASC
catalogo?brands[0]=1473&sort_by=title&sort_order=ASC
catalogo?brands[0]=1460&sort_by=title&sort_order=ASC
catalogo?brands[0]=1247&sort_by=title&sort_order=ASC
producto/la-piramide-prohibida?product_id=11860
producto/pacto-de-amor-pasion-y-aventura?product_id=11861
producto/lr6aa-alcalinas-panasonic-xtreme-power?product_id=11862
producto/lr03aaa-panasonic-powermax3?product_id=11863
producto/lrv08-alcalina-panasonic-powercells?product_id=11864
boletines-de-noticias-de-gmidos
boletin-de-noticias/boletin-de-noticias-de-febrero-2012
ayuda/ayuda-general
ayuda/faq
ayuda/terminos-de-uso
Is extrange because we just
Is extrange because we just see in the nginx wiki the ngx.unescape_url command and it seem to do in paper exactly what we need :(
Can you add the following code to
In the main loop of
cache_warmer_requests.luado:-- Loop over the post_data table (contains the URIs to be hit).for _, u in pairs(post_data) do
-- All requests are HEAD requests.
ngx.log(ngx.ERR, 'uri: ' .. u)
ngx.log(ngx.ERR, 'uri_u: ' .. ngx.unescape_uri(u))
table.insert(requests, { build_req_uri(base_uri, ngx.unescape_uri(u)), { method = ngx.HTTP_HEAD }})
end
And post the error log so that we can see what it's doing?
First of all, thanks for keep
First of all, thanks for keep helping us with this...
Here is a portion of the crawler vhost error log:
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=1, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=2, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=3, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?price[min]=&price[max]=&populate=&sort_by=title&sort_order=ASC&items_per_page=20&page=4, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?categories[0]=1400&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?categories[0]=1400&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:50: uri: catalogo?categories[0]=1249&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
2012/06/05 15:55:37 [error] 1536#0: *33 [lua] cache_warmer_requests.lua:51: uri_u: catalogo?categories[0]=1249&sort_by=title&sort_order=ASC, client: 127.0.0.1, server: crawler.nx, request: "POST /cache-warmer HTTP/1.1", host: "crawler.nx:8890"
It seems that the URIs are correctly constructed
so unescaping it is spurious. The problem is downstream from here. Be it on the multi request handling, be it on the Boost side.
Can you post the logs of the server being hit? I'd like to see what URI is hit.
Thanks,
we put
we put this:
drush_log(http_build_query($post_data, '', '&'), 'error');in the funtion cache_warmer_crawl_multiple inside the main loop and we get this kind of log on the console:
data0=&data1=catalogo&data2=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20&data3=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D1&data4=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D2&data5=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D3&data6=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D4&data7=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D5&data8=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D6&data9=catalogo%3Fprice%5Bmin%5D%3D%26price%5Bmax%5D%3D%26populate%3D%26sort_by%3Dtitle%26sort_order%3DASC%26items_per_page%3D20%26page%3D7&base_uri=http%3A%2F%2Fes.gmidos.nxfinally we make one step
finally we make one step forward. With this code inside cache_warmer_requests.lua...
-- Loop over the post_data table (contains the URIs to be hit).for _, u in pairs(post_data) do
-- All requests are HEAD requests.
u = ngx.re.gsub(u, '&', '%26', 'i')
table.insert(requests, { build_req_uri(base_uri, u), { method = ngx.HTTP_HEAD }})
end
...works ok but the url thats need to be created like this:
catalogo?brands[0]=1247&sort_by=title&sort_order=ASC
creates like this :
catalogo_brands[0]=1247%26sort_by=title%26sort_order=ASC.html
the problem seem to be the '&'...
It seem that
It seem that the...
catalogo_brands[0]=1247%26sort_by=title%26sort_order=ASC.html
...generated pages, were badly generated. All are views pages and after boost caches them it seem that all files sizes are the same and the html generated has problems (views empty results) so that piece of code dont work.
I look on my Boost stats and
I look on my Boost stats and there is 12000+ pages O___O It's normal?
Have any of you tried
microcaching as opposed to Boost? It simplifies your setup and makes it faster. One less thing on the drupal side. Move the cache to the server layer.
I suppose i should give that
I suppose i should give that a try. Im a total noob in Nginx but im using your drupal-nginx config that came with microcaching by default. I will report back then.
Ok, we made our first try in
Ok, we made our first try in setting up the microcaching system... with no luck :)
We have Boost uninstalled and we are using your nginx configuration with microcaching activated for anon users in drupal 7.
The only change we made in the files (only for debugging purposes) was to change this 2 lines:
fastcgi_cache_valid 200 301 60m;
fastcgi_cache_valid 302 60m;
We want to use this in conjuntion with "cache_warmer" and not using microcache per se but use cachewarmer frecuently enough and Nginx cache with a big TTL. Before we tried to test cache warmer, we tryed to browse directly.
When browsing normally the site as anonymous and viewing the headers in Firebug we get "X-Micro-Cache:MISS" in the Response Header for all pages.
Also if we go to the directory where cache files should be saved? /var/cache/nginx/microcache there are no files inside.
Truth is that we have some custom Cookies. Maybe this is too much but we will try to explain some of them:
saw_welcome:
This Cookie is set to "0" for all users (anonymous and authenticated) the first time he visits any of our pages.
In every page load we look at this Cookie and if the user is ANONYMOUS and the Cookie is "0" (or is not set) the user is redirected to a welcome page.
If the user submits the form in the welcome page, this Cookie is set to "1" and the user is redirected to the page he was trying to see.
All subsequent page loads will see that the user has this Cookie set to "1" and will not redirect him again.
has_qtip:
We use the jQuery QTips plugin so we can use custom tooltips in our links.
This Cookie is set to "1" for all users (anonymous and authenticated) the first time he visits any of our pages.
In every page load we look at this Cookie and if the user is AUTHENTICATED and the Cookie is "1", we add a
special class "qtip" to all links in our page so JavaScript can get those links and apply jQuery QTips to them.
The point of this Cookie is that we have a TOOGLE link on top of every page that the user can click and ENABLE/DISABLE this feature.
So everytime the user clicks on this link, JavaScript toggles the Cookie value from "1" to "0" or from "0" to "1".
With the Cookie at "0" the user will see the normal browser's tooltip.
This TOOGLE link is only presented to AUTHENTICATED users so anonymous will always see the jQuery QTips tooltips (as they have the Cookie set as "1").
We have 3 more Cookies (very similar to "has_qtip" Cookie. They are TOGGLES for accesibility/usability purposes) but you get the idea.
This was all working good in Boost, as Boost doesn't care about Cookies and all cached files were created successfully.
So our question are:
1) Is there a way of configuring microcaching so pages are cached and served even if there are Cookies?
So for example, in the case of the "has_qtip" Cookie, the pages should be cached and served with the special class "qtip" in all links. We could get rid of all of our custom Cookies except the "saw_welcome" one, but we think this will not be any solution as we can see other Drupal Cookies (and Google Analytics).
2) We are trying to get it work like boost so is there a way to tell the microcache which pages we want to be cached or which wont using something like wildcards inside your nginx config?
We are trying a lot of things by reading the "HttpFastcgiModule" help pages (http://wiki.nginx.org/HttpFcgiModule) but this is all brand new to us.
Thanks in advance, you have been very helpful already.
It shouldn't be a problem
unless you have a session cookie. It's the only cookied that bypasses the cache. Here's a little example in a D7 site:
curl -I -H 'Cookie: sam_welcome = 1' http://d7HTTP/1.1 200 OK
Server: nginx
Date: Wed, 06 Jun 2012 09:00:46 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Keep-Alive: timeout=10
Vary: Accept-Encoding
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Last-Modified: Wed, 06 Jun 2012 09:00:42 +0000
Cache-Control: no-cache
ETag: "1338973242"
X-Protected-Asset: Y
Content-Language: en
X-Micro-Cache: HIT
and in the debug log the FCGI request params show the cookie untouched:
2012/06/06 11:00:42 [debug] 21082#0: 2547 fastcgi param: "SCRIPT_FILENAME: /var/www/sites/d7/index.php"2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_USER_AGENT: curl/7.25.0 (x86_64-pc-linux-gnu) libcurl/7.25.0 OpenSSL/1.0.1c zlib/1.2.7 libidn/1.24 libssh2/1.4.0 librtmp/2.3"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_HOST: d7"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_ACCEPT: */"
2012/06/06 11:00:42 [debug] 21082#0: *2547 fastcgi param: "HTTP_COOKIE: sam_welcome = 1"
Try the
microcache_fcgi_auth.confinstead. It handles theCookieandSet-Cookieheaders differently. They're not analysed at all, they're passed untouched to the backend.Hi Perusio... We gets HITS
Hi Perusio... We gets HITS when we use microcache_fcgi_auth.conf but ONLY if we modify this line.
fastcgi_ignore_headers Cache-Control Expires;to
fastcgi_ignore_headers Cache-Control Expires Set-Cookie;I´m begining to undestand the logic of all this. This works great if we really use a microcaching (15s - 5s) aproach because it caches everything for so little time that barelly maters except for scale. In our case, as we are trying to mimic boost and we have a ecommerce site we were thinking more of an aproach where
fastcgi_cache_validfor 12h and cache warmer doing his job with the frecuency we find usefull. Because we have it configured like that (12h) the admin pages keep getting HITS for admin roles and thats not good so as with boost we need a way to tell nginx to only caches certain pages and posibly to only cache anonymous users.1) Our original Boost cacheability settings -> Cache specific pages -> Only the listed pages ->
home*
catalogo*
contacto*
producto/*
ayuda/*
boletines-de-noticias-de-gmidos
boletin-de-noticias/*
as boost support wildcards
2) The second part is trying to get it work with an anonymous config and making exceptions (no_cache) if the cookie is set or not (we still trying to figure out how to do this
As we said before, we need to mimic how it worked when boost was installed.
any insight to any of this will be much apreciated
After learning a lot of what
After learning a lot of what your config codes do, we have managed to microcache our commerce site for Anonymous users.
Now we want to limit the cache to only specific pages as we had in Boost (so pages like "/admin" or the Shopping Cart, CheckOut pages, Profile pages, etc... will remain uncached).
We have the following code in "microcache_fcgi.conf":
set $no_cache "0";if ($uri !~* "^/(catalogo|contacto|producto/|ayuda)") {
set $no_cache "1";
}
fastcgi_no_cache $no_cache;
fastcgi_cache_bypass $no_cache;
The only problem we have is that we can't seem to cache the HOME page.
We have tried a lot of possible combinations for calling this page:
""
"/"
"/home"
"$document_root"
"^/"
"^/$"
""
...etc...
We have also tried with the following conditional without luck:
if ($uri = "ALL-COMBINATIONS-WE-CAN-IMAGINE-FOR-CALLING-THE-FRONT-PAGE") {set $no_cache "0";
}
We have also tried using
$request_uriinstead of$uriwith the same luck.In Drupal (/admin/config/system/site-information) we have configured the front page to be "/home".
So is there an easy way of checking the front page
$uri?Thanks in advance.
No need to use an if
you can add a
mapdirective for that inmap_cache.conf:map $uri $no_cache_uri {default 1;
/ 0;
~^/(?:catalogo|contacto|producto|ayuda) 0;
}
And then add:
fastcgi_no_cache $no_cache $no_cache_uri;fastcgi_cache_bypass $no_cache $no_cache_uri;
Try it.
Thank you again perusio and
Thank you again perusio and sorry for the late response. We are using "map" now to make all of our conditions inside "map_cache.conf" and all is working good.
Well, of course we are having more troubles :) This time are the form submits and Drupal messages.
A couple of days ago we were having troubles login in the site with the login block (it's on every page on our site). After login, the page was refreshed but you still were anonymous until you go to an un-cached page. So we end up redirecting this form to the user's profile page (which is not cached).
All good until we tried to use othe forms in our site, like the contact form, the "subscribe to our newsletter" block (simplenews), the "send this product to a friend" form (forward), etc... After submiting any of these forms, a message is sent (for example the "Your message was sent succesfully" in the contact form).
The problem is that:
- If you submit the form in a already cached page, no message is visible until you reach an un-cached page (where it then displays).
- If you submit the form in a non-cached page, the message is displayed, but the page is cached so other anonymous that go to the same page will see that message.
We understand it's a POST problem? We have a map as the following:
map $request_method $no_cache_method {default 0;
POST 1;
}
and then:
fastcgi_no_cache $no_cache $no_cache_uri $no_cache_method;fastcgi_cache_bypass $no_cache $no_cache_uri $no_cache_method;
but this is not working at all.
Do we have to alter all forms to add a "?nocache=1" query to the redirection and use the $arg_nocache variable?
fastcgi_no_cache $no_cache $no_cache_uri $arg_nocache;fastcgi_cache_bypass $no_cache $no_cache_uri $arg_nocache;
Will try tonight some more tests but we were wondering if there's another way. What we need is to bypass saving/serving a cache if a form was submitted.
Thanks again :)
Hmmm
By default the cache methods are only HEAD and GET.
If a user submits a form, then it's a POST and it's not cached. The problem is that the message appears after the POST, hence it's a regular GET request to get that page.
We need a way to keep state. My suggestion is to create a cookie with a short life time that is set by the form submit handler and that pierces the cache.
Check this module: http://drupal.org/project/cookie_cache_bypass_adv
I think it solves your problem.
Set
fastcgi_no_cache $no_cache $no_cache_uri $cookie_NO_CACHE;fastcgi_cache_bypass $no_cache $no_cache_uri $cookie_NO_CACHE;
for that module.
Ok, you are officially our
Ok, you are officially our hero now :)
The code from
cookie_cache_bypass_advmodule is simple enough to grab it and add it to one of our custom modules (one that manages all cache related things).The NO_CACHE Cookie is set after form submits and
$cookie_NO_CACHEis working as expected. We even modify again our login block to not redirect to the user profile page and it worked :)Just a quickie here, what do you think is the best
Cookie Expire Timefor this Cookie? By default is 300 seconds which we think is too much so we made it 10 seconds.Also you said that by default only GET/HEAD are cached. Do you mean that our map:
map $request_method $no_cache_method {default 0;
POST 1;
}
is not needed?
We initially added it because the "cache_warmer" PARALLEL option is using POST and we thought that this method was the trouble maker.
Thanks again.
Not needed indeed
because the default setting of the fastcgi cache is to cache only for GET and HEAD requests.
You should choose the value that suits you better. 10 seconds seems reasonable. That way the user gets the message as it should and quickly returns to the cached setup.
Thanks :)
Thanks :)
Hi again. We still having
Hi again.
We still having problems trying to crawl URIs like the following with the PARALLEL crawler:
catalogo?brands[0]=1500&sort_by=title&sort_order=ASCThose URIs get crawled with only the first parameter:
catalogo?brands[0]=1500This does not happend in the SINGLE crawler as stated before, so we were using the SINGLE for these types of URIs and the PARALLEL for the normal ones.
Till here we are good, we don't mind using the 2 crawlers.
The problem is that cache files created in the SINGLE crawler are empty. Well, not really empty, but only with the headers. If you go to that crawled page, you see a blank page.
The only thing we came up is modifying the cUrl options array and setting the CURLOPT_NOBODY option to FALSE:
CURLOPT_NOBODY => FALSEAfter this cache files are full with the headers AND the html, and if you go to the crawled page you see it as it should be.
We thought that all was good, but yesterday we started creating log files.
Our log file (adding
> path/log_file.logto the Drush command) is getting filled with HTML from all crawled pages, so it ends up being 90Mb.Inside the log, if we scroll down to the end, we can see the JSON responses returned by
cache_warmer_execute.Setting again
CURLOPT_NOBODY => TRUEmakes the log with only the JSON responses, but makes our cache files empty.Any ideas what could be the cause for the response HTML getting inside our logs?
Thanks in advance
Yes
Because you've set
CURLOPT_NOBODY => FALSEnow it returns the body. Try setting the cURL options:CURLOPT_NOBODY => TRUECURLOPT_HTTPGET => TRUE
I suspect those pages require a
GETto be generated.Sorry that didn't do
Sorry that didn't do it.
Cache files are ok but the log file is still getting full of HTML.
Seems that returning the body automatically fills the Drush log with those bodies.
Will keep trying, thanks again
Ok we ended up resolving this
Ok we ended up resolving this by using:
CURLOPT_NOBODY => TRUE,CURLOPT_HTTPGET => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
Now both the cache files and log file are ok.
Makes any sense to you?
It does
What I'm curious about is why the URIs with several
&are not handled properly by the lua client. Could be some lua socket quirk though. That's code that is getting long in the tooth. I would like to move it to the Nginx Lua cosocket API as soon as possible.It won't be able to do massive parallelization (thousands of requests in parallel) with lua socket.
This is work in progress.
HTTPRL
Might want to work on HTTPRL integration as it can handle massive parallelization from my tests. I currently limit it to 128 global connections and 8 domain connections but these values can be changed. http://drupal.org/node/1426856
That has been on my TODO list for some time as you know
I'll have more time to deal with after devdays. It will be cool to compare the resource usage and speed of httprl and Lua cosocket. I don't think PHP can compare with Lua. No JIT, much larger language, I don't think it goes beyond
select()at the I/O event notification layer. My main interest in having httprl is that is an alternative to following the Lua route.OTOH I think that httprl will probably beat cURL multi easily in terms of performance.
So the move to lua cosocket
So the move to lua cosocket will permit massive parallelization on the contrary of the actual lua socket method? Is this correct?
With that move do you think the
&URIs problem in the hub pages could be resolved?Any views page with exposed filters are suffering now so we have to use the single curl way which is ok but not great as you could imagine :)
I don't think
it can handle thousands of simultaneous connections. The weak link is l luasocket. With cosocket yes becauses it uses the event loop made available by the Nginx API.
Hi again, more tests trying
Hi again, more tests trying to crawl URIs with several
&in the parallel crawler:In
cache_warmer_client.luawe can see thatngx.var.arg_ucomes already with only the first parameter (removing the first&and everything after that).So as you said, maybe it's a problem of
ngx.location.capture_multi().We end up modifying
cache_warmer_requests.luahardcoding paths following guidance from the HttpLuaModule help pages:ngx.req.read_body()
local post_data = ngx.req.get_post_args()
local base_uri = post_data['base_uri']
post_data['base_uri'] = nil
local requests = {}
-- HARDCODED PATHS
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1425&sort_by=title&sort_order=ASC", { method = ngx.HTTP_HEAD }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1393", { method = ngx.HTTP_HEAD, args = "sort_by=title&sort_order=ASC" }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1412", { method = ngx.HTTP_HEAD, args = { sort_by = "title", sort_order = "ASC" } }})
table.insert(requests, { "/parallel-reqs?u=" .. base_uri .. "/catalogo", { method = ngx.HTTP_HEAD, args = "brands[0]=1406&sort_by=title&sort_order=ASC" }})
local responses = { ngx.location.capture_multi(requests) }
for _, r in pairs(responses) do
ngx.say(r.status) -- get the status only (HEAD)
end
All paths except the last path created a cache file with only the first parameter (eg. "/catalogo?brands[0]=1425").
The last path created a cache file without parameters at all ("/catalogo").
We also tried using
ngx.HTTP_GETas method with the same luck.Same thing happends using
ngx.location.capture()as following:ngx.req.read_body()
local post_data = ngx.req.get_post_args()
local base_uri = post_data['base_uri']
post_data['base_uri'] = nil
local requests = {}
local responses = { ngx.location.capture("/x-parallel-reqs?u=" .. base_uri .. "/catalogo?brands[0]=1425&sort_by=title&sort_order=ASC", { method = ngx.HTTP_HEAD }) }
for _, r in pairs(responses) do
ngx.say(r.status) -- get the status only (HEAD)
end
So both capture and capture_multi have the same problem?
Hmm
have you tried the last URI with a GET request instead?
This needs to be further investigated.
Yes, IIRC we tried all 4 URIs
Yes, IIRC we tried all 4 URIs with HEAD and GET with the same results. Will try again this weekend, though.