aggregated css/js 404 prevention module

mikeytown2's picture

Problem: If the aggregated css/js file doesn't exist then drupal is loaded, looking for it; a 404 for the file is usually the result.
Solution: Have a lookup table pairing the old file to the new one, returning the new aggregated file in the process.

Create 2 menu hooks
file_directory_path() . '/css/%'
file_directory_path() . '/js/%'
Lookup table gets populated via http://api.drupal.org/api/function/template_preprocess_page/6 hook; $variables['styles'] for css files, $variables['scripts'] for js files. Finding related css/js files done by path, which is gathered via $_GET['q']. Store time stamp so we try the most recent versions first.

DB Table:
Saving
filename - aggregated filename; value from $variables['scripts'] or $variables['styles']; foreach() to grab all.
path - $_GET['q'];
timestamp - time();
filesize

Lookup
Given filename get path & timestamp. Find all paths sorted by timestamp, greater then timestamp just returned in query. If file doesn't exist try next one. If file can't be found, try finding file that is closest via filesize. If still can't be found call path with special query string that temp sets $GLOBALS['conf']['preprocess_css'] = TRUE; and $GLOBALS['conf']['preprocess_js'] = TRUE. Once request is done, repeat search. If still no match finally give up with 404.

This should prevent 404's for 99% of the these cases. Thoughts?

Comments

Great idea. I see of lot of

mrfelton's picture

Great idea. I see of lot of 404's for aggregated css files in our webserver logs. I'm not sure about the specifics of this implementation or if there would be a better way to accomplish the same thing, but the basis behind the ideas is definitely sound and I'd be happy to help out with development and testing.

--
Tom
www.systemseed.com - drupal development. drupal training. drupal support.

I'd be interested in this.

christefano's picture

I'd be interested in this. The Onion reduced their "server time" (whatever that means) by optimizing 404s.


Exaltation of Larks
Founder, CEO
http://www.larks.la  
Droplabs
Founder, Lead Burrito Analyst
http://droplabs.net  
Greater Los Angeles Drupal
Organizer, Drupal Adventure Guide
http://drupal.la  

More Thoughts

mikeytown2's picture

Bypass core aggregation for more awesomeness. Each css or js file is md5-ed; this allows Drupal to know what files have changed and thus what aggregated files need to be updated. Filenames use simple version-ing, a build counter; 0, 1, 2, 3, etc. Aggregated files are a md5 of the filenames contained inside. Example:

filename                  md5                               bundle
themes\garland\style.css  9f352cd54fe4de666830cea532cfd4ed  b585a8828fb32b8c5c8c04e2030dca7e, ...
modules\node\node.css     82a0944588e5f30ca0734e2364329701  b585a8828fb32b8c5c8c04e2030dca7e, ...
modules\user\user.css     3f5f69d06fd44b811f4bf16621756f33  b585a8828fb32b8c5c8c04e2030dca7e, ...

md5('themes\garland\style.css' . 'modules\node\node.css' . 'modules\user\user.css') = b585a8828fb32b8c5c8c04e2030dca7e

aggregated file would be /files/adv-css/css_b585a8828fb32b8c5c8c04e2030dca7e_0.css

If style.css changes then this turns into

filename                  md5                               bundle
themes\garland\style.css  666830cea532cfd4ed9f352cd54fe4de  b585a8828fb32b8c5c8c04e2030dca7e, ...
modules\node\node.css     82a0944588e5f30ca0734e2364329701  b585a8828fb32b8c5c8c04e2030dca7e, ...
modules\user\user.css     3f5f69d06fd44b811f4bf16621756f33  b585a8828fb32b8c5c8c04e2030dca7e, ...

aggregated file would be /files/adv-css/css_b585a8828fb32b8c5c8c04e2030dca7e_1.css. The old file css_b585a8828fb32b8c5c8c04e2030dca7e_0.css would be deleted and any references to it will be redirected to the newest number. This would allow for css changes to take place without having to flush the page cache because the file will be redirected automatically if it doesn't exists. Redirect will be .htaccess (I think this is possible) and php. If you change html & css it will be your responsibility to flush the (external) page cache. This redirect magic doesn't work if using a CDN so overwriting of the same file should be supported with the url query trick ?asdf used to force a new download. CDN mode would be an option (file always stays at _0). The other option for CDN would be a second css file containing the diff; this would be impractical IMHO. Second file would always be local and always be present in the DOM so any changes will be sent out.
Thoughts?

.htaccess

mikeytown2's picture

This looks like it can be done with apache; RewriteMap can not be declared in .htaccess; must be done in httpd.conf.
http://httpd.apache.org/docs/trunk/rewrite/rewritemap.html#txt
httpd.conf

RewriteMap getdefaultcss txt:/var/www/sites/default/files/adv-css/lookup.txt

.htaccess

RewriteCond %{REQUEST_FILENAME} !-s
RewriteRule sites/default/files/adv-css/css_(.+)_([0-9]+)\.css$ sites/default/files/adv-css/css_$1_${getdefaultcss:$1|0}\.css [L]

lookup.txt

b585a8828fb32b8c5c8c04e2030dca7e    2
3f5f69d06fd44b811f4bf16621756f33    83  

So if a request comes in looking for css_b585a8828fb32b8c5c8c04e2030dca7e_0.css & that file doesn't exist, the contents of css_b585a8828fb32b8c5c8c04e2030dca7e_2.css will be sent. lookup.txt gets updated by drupal. Drupal will also have a php backup in case the server doesn't doesn't use htaccess. Backup will be a hook_menu on the sites/default/files/adv-css/% path. Rules do not appear to be 100% compatible with a multisite install. I don't know if this will work

RewriteCond %{REQUEST_FILENAME} !-s
RewriteRule sites/(.+)/files/adv-css/css_(.+)_([0-9]+)\.css$ sites/$1/files/adv-css/css_$2_${get$1css:$2|0}\.css [L]

Edit:
The other option is to allow all multisites to write to the same file; thus only requiring 1 RewiteMap. This seems to be the best option. This will require something like the cache directory in boost (directory that is universally writable to all sites) and a lock file so the map file doesn't have 2 processes trying to update the same file. This will also have to solve the multi-server issue as well...

using javascript to detect missing css files

mikeytown2's picture

I'm playing around with some prototype code in javascript for detecting if a css file is missing. This JS does not depend on jQuery so it could be added in the dom. It's probably the wrong way to do this, but I found it interesting... drop in the firebug console and click run.

my_css_files = '';
counter = 0;
total = 0;
function check_404(url) {
  total++;
  if (window.XMLHttpRequest) {
    // code for IE7+, Firefox, Chrome, Opera, Safari
    var xmlhttp = new XMLHttpRequest();
  }
  else {
    // code for IE6, IE5
    var xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
  }
  xmlhttp.open("GET", url, true);
  xmlhttp.onreadystatechange = function () {
    if (this.readyState == 4) {
      if (this.status == 0 || this.status == 404) {
        my_css_files = my_css_files + "\n" + url;
      }
      counter++;
    }
  };
  xmlhttp.send();
}


function done_done() {
  if (total == counter) {
    if (my_css_files != '') {
      alert("Missing CSS files:\n" + my_css_files);
    }
    else {
      alert("All CSS files have been downloaded correctly");
    }
  }
  else {
    setTimeout(done_done, 250);
    //alert(document.styleSheets.length + " " + counter);
  }
}

for (i=0; i<document.styleSheets.length; i++) {
  if (document.styleSheets[i].href != null) {
    check_404(document.styleSheets[i].href);
  }
}
done_done();

If the client knows that the css files are missing then it could do a request to the server asking for the new name of the css file for this page and then load it into the DOM and go from there.

This page has example JS for loading css/js files dynamically.
http://www.javascriptkit.com/javatutors/loadjavascriptcss.shtml

So it looks like this can technically work.

Success?

beifler's picture

This appears to be a great solution. I found this thread after looking at the number of 404 errors in my logs for aggregated CSS/JS files.

Have you been running it on a production site/multisite mikeytown2?

Still idea stage

mikeytown2's picture

This is still at the idea stage at this point. Haven't written any code yet.

10-4

beifler's picture

Understood, thanks. I'm looking forward to this. Are you aware if D7 addresses this issue?

Sorta

mikeytown2's picture

This issue kinda addresses parts of the problem; but doesn't offer a redirect like my idea does. Not sure if it keeps the same hash for the same grouping of files as well.
http://drupal.org/node/721400

D6 module groundwork

mikeytown2's picture

Advanced Aggregation Module: OR port these ideas into the BundleCache module.

Bypass core aggregation
Needs to use hook_theme_registry_alter to alter the "preprocess functions" of the page hook adding a special version of drupal_build_css_cache to alter the contents of the $variables['styles'] variable. This function HAS to run last; that is why it is going to use hook_theme_registry_alter instead of template_preprocess_page.

Database tables
I see 3 tables:
adv_agg_files: filename [text], filename_md5 [32varchar], file_mtime [int]
adv_agg_bundles: filename_md5 [32varchar], bundle_md5 [32varchar], media [32varchar], order [int], counter [int]
adv_agg_pages: bundle_md5 [32varchar], url [text]

When a cache clear happens it scans all adv_agg_files to see if the file_mtime has changed; if yes then it knows all bundles containing this file need to be rebuilt; that is what the adv_agg_bundles table is for. Using the adv_agg_pages table one could analyze what files are used on what pages to create smarter bundles in the future. Analysis would happen on cron once a week and the only setting is how many bundles per page to allow.

Extra Stuff
Use D7 code when possible.
Will rebuild the css files on cache clear; once file is built then new requests will point to it. Most likely use an async worker for the cache clear & rebuild. Can be rebuilt with just the adv_agg_bundles table.
Will handle odd use cases like a S3 file system mount; which means that writing to the dir is not a 100% sure thing (http://drupal.org/node/755586 http://drupal.org/node/818818).
Will gzip css & js files if this setting is enabled.
Use drupal_delete_file_if_stale when flushing the adv_agg_(css|js) dir.
CDN compatible out the door.
Implement hooks for other modules to use...
http://drupal.org/project/unlimited_css
http://drupal.org/project/ie_css_optimizer
http://drupal.org/project/csstidy
http://drupal.org/project/css_emimage
http://drupal.org/project/closure_compiler
http://drupal.org/project/javascript_aggregator

Why?
Core's aggregation is not playing nicely currently. Also got some time at work to make this happen. Don't have boost try to handle missing css/js files.

D7?
Don't have enough hours on it in order to see what could be better.

more thoughts

mikeytown2's picture

Database tables & their uses
advagg_files: filename_md5 [PK], filename [text], file_checksum [32varchar]
Stores the files that get aggregated & a checksum. When a cache flush happens, every entry in here is checked against the file on the system; if the files checksum (mtime or md5 of file contents) is different than what is in the database, then lookup the bundles that contain the filename_md5 & rebuild the bundles returned.
Example query:

SELECT *
FROM advagg_files AS af
INNER JOIN advagg_bundles AS ab USING ( filename_md5 )



advagg_bundles: bundle_filename_md5 [PK], filename_md5 [key], bundle_filename [text], counter [int] data object: { info needed to rebuild the bundle}
(bundle_filename_md5 & filename_md5 = Unique key; counter is the same across this unique key)
If 1 or 20 files are bundled together it gets added to this database. bundle_filename_md5 is generated by the md5 of all the filenames contained in the bundle; something like md5(serialize($filenames[$types])).



advagg_bundles_mapper: full_bundle_md5 [PK], bundle_filename_md5 [key]
(full_bundle_md5 & bundle_filename_md5 = Unique key)
When a page is generated it uses md5(serialize($types)); this is the full_bundle_md5 value. This returns an array of bundles (at first it will be 1) that represents the bundles to use on this page.
Example query:

SELECT ab.bundle_filename
FROM advagg_bundles_mapper AS abm
INNER JOIN advagg_bundles AS ab USING ( bundle_filename_md5 )
WHERE full_bundle_md5 = '%s'



advagg_pages: url [PK], full_bundle_md5
Record what bundles get used on each page.



Typical page request
drupal_add_css has a bunch of files. $full_bundle_md5 = md5(serialize($types)); is used to generate the lookup for this query

SELECT *
FROM advagg_bundles_mapper AS abm
INNER JOIN advagg_bundles AS ab USING ( bundle_filename_md5 )
WHERE full_bundle_md5 = '%s'

Info from this query is used to generate the files used on the page. The counter from advagg_bundles is the ..._[n].css part. It gets incremented every time a file in the bundle changes. Might wrap around after counting up to 99,999.



Analysis
BFJ query (big freakin join) is done in order to figure out where all the files end up for all the URLs.

SELECT *
FROM advagg_pages AS ap
INNER JOIN advagg_bundles_mapper AS abm USING ( full_bundle_md5 )
INNER JOIN advagg_bundles AS ab USING ( bundle_filename_md5 )
INNER JOIN advagg_files AS af USING ( filename_md5 )

Have yet to figure out what to do with this data & how to analyze it. Only real data will let one know.



Notes
Bundle analysis will not be coded at first. This means that only 2 database tables are needed: advagg_files & advagg_bundles. I will build the module like this at first and will probably create a "bundle" sub module. "No 404's" for js/css is important enough as is.
Example query:

SELECT bundle_filename
FROM advagg_bundles
WHERE bundle_filename_md5 = '%s'

This means that at first the advagg_bundles_mapper will have the same data in full_bundle_md5 & bundle_filename_md5 because there will only be 1 bundle per page.

Use D6 code at first because I know it very well.

One more thing - Garbage collection

mikeytown2's picture

I need some sort of last used timestamp
Options:
- advagg_bundles could have a timestamp field. updated on every request.
- I can touch() the file when its going to be used. updated on every request.
- Use fileatime() to read the last time it was used. Read on cron. fileatime is not reliable though.
- cache_advagg_bundle_timestamp could be used. Key is the bundle name, data is the timestamp. updated on every request.
I like the cache table the best because we use memcache. Any other input on the above options?

The advagg_bundles table will never be cleared automatically. Reason being it allows for a 404 handler to create the missing file on demand then. In the UI there will be an option to clear this table.

EDIT:
cache_get to read timestamp; if timestamp is older then 1 hour then cache_set with time(). Should make timestamps less of a performance issue.

New Permission:

mikeytown2's picture

Create a new permission; one that allows you to turn off css/js aggregation via something like example.com/about-us?advagg=0 Should make theme debugging simpler. Only works if user has permission to do this.

md5('themes\garland\style.css

jcisio's picture

md5('themes\garland\style.css' . 'modules\node\node.css' . 'modules\user\user.css') = b585a8828fb32b8c5c8c04e2030dca7e

aggregated file would be /files/adv-css/css_b585a8828fb32b8c5c8c04e2030dca7e_0.css

Could you please explain briefly why don't use /files/adv-css/css_b585a8828fb32b8c5c8c04e2030dca7e.css?0 instead? There won't by any problem.

S3

mikeytown2's picture

A S3 file system using cloud front is the issue. Changes to the same filename don't replicate for up to 24 hours; thus a new file must be used. This is biting us in the butt pretty bad right now; we've hacked core and disabled css/js file deletion on cron and this wouldn't be an issue except for the fact that $query_string is only 1 char long. Rather then hacking up core more, create a module or a patch for an existing module. Using the bundle idea the filename would never change and thus changes to css/js wouldn't go out for up to 24 hours. Changing the filename is the right way to do this; deleting all css files on a cache clear is what's dumb.

Inside drupal_get_css:

<?php
$query_string
= '?' . substr(variable_get('css_js_query_string', '0'), 0, 1);
...
$filename = 'css_' . md5(serialize($types) . $query_string) . '.css';
?>

I like the way you think,

soyarma's picture

I like the way you think, Mikey. rather than a year-long thread about how to splice this into core, a nice slick module to do it seperately

prototype

mikeytown2's picture

http://drupal.org/node/1063012
Garbage collection & 404 handling is not implemented yet.

?advagg=0 & gzip are implemented. Code is mainly D6 based.
At this point I feel like bringing in the 3rd party optimizers that are out there; as this module is incompatible with all of them due to the dir name change.

BTW, did someone say google cdn?

updated

mikeytown2's picture

http://drupal.org/node/1063012#comment-4102884
Garbage collection and 404 handling is now implemented.

Need to verify but once the aggregated file has been created, it should no longer hit the disk when generating the page. For us and our setup (shared files directory across multiple servers) this might shave 100ms off of every request.

counter

mikeytown2's picture

add a changed counter to the files table; will help with bundle analysis later.

It's out!

mikeytown2's picture

Crated the module; now waiting for the dev to get packaged.
http://drupal.org/project/advagg
Feedback is greatly appreciated.

1.0 is out

mikeytown2's picture

Let me know how it works. If it doesn't please file a detailed bug report :)
http://drupal.org/project/advagg

Great news will test it out

marios88's picture

Great news will test it out and report back

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: