I want to start this thread as a place for us to add information on the current implementation of the roadmap for the Drupal Caching subsystem for core.
During July 2009, the current cache.inc was rewritten and not a lot of future proofing was done to make cache as effective as possible for module developers as well as for core. Here are a list of the goals to allow for a good api for both core and contrib.
Goals for D8 Cache API
- Remove cache_clear_all function. - This function is ambiguous and is the only "delete" function in the api. This should be split out into a couple clear and concise functions for easy use by developers that fits into the CRUD style used elsewhere in Drupal.
- Create simple effective configuration for multiple bin setup. (e.g. configuration syntax for mixing and matching cache storage engines)
- Create a hierarchal system for allowing cache tables to be "chained" - Currently memcache has memcache.inc and memcache.db.inc. Memcache should only have two types of bins, memcache and memcache shared bins. The database part should be abstracted out into core db cache.
- Create session support for Cache API - The session support should not be dependent on the cache module. You should be able to use any cache technology for session support.
- Create administrative interface - There should be an interface for managing or at a minimum viewing cache information. This should be in core and allow administrators access to view what is going on with the cache system. This will be something that can start as a contrib module for D7 and added to core for D8.
- Add contrib caching technologies to core - We should have APC + Memcache as core caching technologies as a base line, and any technology that is effective and useful should be added to core as it matures (e.g. Xcache, file cache, mongodb, etc)
As far as the API is concerned, here is a list of functionality available for D7 currently and what should be available based on cache systems for other projects.
Current Cache API
- cache_get($cid, $bin = 'cache')
- cache_get_multiple(array &$cids, $bin = 'cache')
- cache_set($cid, $data, $bin = 'cache', $expire = CACHE_PERMANENT)
- cache_clear_all($cid = NULL, $bin = NULL, $wildcard = FALSE)
- cache_is_empty($bin)
Proposed Cache API v2 for D8
- cache_get($cid, $bin = 'cache')
- cache_get_multiple(array &$cids, $bin = 'cache')
- cache_set($cid, $data, $bin = 'cache', $expire = CACHE_PERMANENT)
- cache_add($cid, $data, $bin = 'cache', $expire = CACHE_PERMANENT) (atomic version of cache_set)
- cache_delete($cid, $bin = 'cache')
- cache_delete_wildcard($prefix, $bin = 'cache') (remove 'string*' type of data or call _flush if $prefix is null)
- cache_flush($bin = 'cache')
- cache_is_empty($bin)
- cache_lock($bin = 'cache')
- cache_unlock($bin = 'cache')
DrupalCacheInterface implementation SHOULD implement all of these operations, and MUST implement a minimum set to be decided later. My initial feeling is that everything except atomic adds MUST be implemented, and if atomic adds are not available in the implementation, then the api function call should add the data but return false to signify the atomicity was not available. Although I'm not exactly sure how that should be handled.
We should also decide a naming convention for DrupalCacheInterface implementations. Currently people are using various naming conventions that may or may not be easy to continue.
I have more thoughts on this, but I wanted to get this out there for now. Please feel free to edit this Wiki as needed and decided.
Comments
A couple points
Session is not a key-value store and it's not clear why it should use the same technologies as cache. We always wanted to query both by sid and uid, but in D7 we query on ssid too. That's not easy to implement in anything that's not a DB...
I agree that cache_add is missing and that the cache umbrella cache_clear_all needs cleaner semantics (although I think the functionality is the same?)
cache_is_empty -- why is that even necessary? It might not be trivial (at all) to figure out whether a given cache bin is empty. There might be objects but they are expired etc and with a key-value store it's practically impossible to figure out.
cache_lock / unlock -- the existing locking framework might build on top of cache_add but I do not think the locking framework needs to be thrown out. Or you mean a process wants to gain exclusive access to a bin? What's the use case?
I am very unsure of adding every sort of cache implementation to core, so far we have only added things to core that are available everywhere or most places. Of these, only file is such.
I am wondering, does anyone use memcache.db.inc at all? A generic cache hierarchy would involve some work to make sure caches are not stale and I am not 100% of the win.
Matching cache bins with available backends, that might be useful, on the other hand, cache is needed before the variable system so some cache configuration needs to be in settings.php still. Of course, this now is less of a problem with the update system at hand.
I agree about adding all the
I agree about adding all the cache implementations to core. A majority of the sites out there run on shared hosting, so they wouldn't even have access to anything but file cache. In that case I think the overall community would be served better with a Boost style system implemented in core.
What might be a better improvement coming up with a standardized system in core to assign different bins to different engines. For example; the db caching in Cacherouter is somewhat mimicing the core caching in Drupal. If we had a standardized way of setting cache bin engines through settings.php then you could set the bins you wanted to use DB and they would just use the core DB caching. Also you would be able to add in only the engines you want.
So you would end up with caching working as-is (all db) out of the box, but you have the option to extend it in settings.php
$conf['cache']=array(
'cache_cache' => array(
'engine'=>'sites/all/modules/cacherouter/engines/memcache.inc',
{...rest of the settings here...}
)
);
Another option would be to follow the libraries model that modules like TinyMCE uses, where we have a sites/all/engines/cache directory. On bootstrap Drupal scans that directory and builds a list of the engines installed in it and then you can do a simple engine=>'memcache" declaration.
As far as mechanics, cache would move to more of an OOP design and utilize factories for the common methods (set,get,flush,etc.).
$conf['cache']=array(
'cache_cache' => array(
'engine'=>'memcache', // Uses the second method of declaring cache engines I mentioned above.
)
);
// We define a factory abstract class
abstract class cacheEngine {
public function set($key,$value,$expires){}
}
class memcacheCacheEngine extends cacheEngine {
public function set($key,$value,$expires=CACHE_PERMANENT){
//set your cache here.
}
}
function cache_init(){
global $conf;
$loaded=array();
if (is_array($conf['cache'])){
foreach ($conf['cache'] as $bin=>$bin_data){
if (!$loaded[$bin_data['engine']] ){
// Load the engine here.
$loaded[$bin_data['engine']]=true;
}
}
/*
Call the cache engine constructor and assign it to an internal variable on the cache bin array.
We could add a further check here so that if that engine doesn't exist (accidentally deleted, etc.) cache would roll back to the db engine.
We add in the CacheEngine to all classes to prevent collisions. Memcache is already a class in PHP, so if an engine is named memcache the engine class would collide with the internal memcache.
*/
$class=$bin_data['engine'].'CacheEngine';
$conf['cache'][$bin]['_engine'] = new $class($bin_data);
}
}
function cache_set($key,$value, $bin='cache_cache'){
global $conf;
if ($conf['cache'][$bin]['_engine']){
//This bin has an external engine so let's use that.
call_user_func(array($conf['cache'][$bin]['_engine'], 'set'), $key, $value);
} else {
//do default db caching here.
}
}
The other benefit on this system is that we could employ the static caching system from D7 in this. Basically on cache_set it would reset the static cache and check for it on cache_get. That would prevent multiple calls to caching engines for the same data. It would also greatly reduce the amount of coding required for external caching systems as they would now rely on core's cache_get, cache_set, etc. methods.
This also wouldn't add any significant overhead to the current caching system in Drupal. Instead it would provide a much more elegant and simpler way to extend the current caching mechanism.
(Disclaimer: That code was written with only a 1/2 cup of coffee in me, so no guarantees.)
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Please take a look at the
@intoxination -
Please take a look at the cache router module. All of this was done two and a half years ago.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
Yeah I'm aware of that and
Yeah I'm aware of that and use it in a few sites. I'm just saying that that principal would be a great thing to go into core and could be improved upon so that extra "engines" are much simpler to develop and maintain. Instead of having to rewrite all the core caching methods to implement a new system, you would go the OOP route and create a class basically doing the actual connections/inserts/updates/deletes. That's pretty much it.
But I'm pretty much with CHX when it comes to including support in core for Memcache, APC, etc. I would say that less than 20% (and that's a very liberal number) of Drupal sites out there run any kind of caching outside of normal DB or Boost. So keeping the engines out of core would probably be a better route, but there is definite room for improvement on expanding the current caching layer and exposing a much more robust API.
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Maybe you haven't seen the
Maybe you haven't seen the code for Drupal 7 cache and cache router. Both of these do exactly what you are saying. In cache router, I implemented abstract classes with exactly what you are saying. In Drupal 7 cache.inc, chx implemented a class interface that defines the methods and created a db class implementing that interface. You should take a look at the code to see how both of these systems work.
As far as your notion that we shouldn't include these other engines in core because people aren't using them is poor logic. I believe that these modules are complex to setup and a lot of the reason people are not using them is either because they are complex, or because they just don't know that they are available. I think if Drupal shipped with alternative cache engines that we would be setting the standard for cms's (cake php and zend framework already ship with these) and we could then simplify the setup and interface for managing these.
For instance, we could make recommendation based on currently installed software and generate the settings code to paste in which i believe would lead to greater adoption and faster Drupal sites.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
I hadn't noticed that was in
I hadn't noticed that was in core, so I stand corrected and that is awesome (great work chx!).
I still believe though that these are better as separate projects. Before you call my logic "poor", consider this:
A majority of Drupal sites are on shared hosting ( I seem to remember something somewhere where Dries estimated over 80% of Drupal installations were on shared hosting). These people basically have access to DB or file based caching and that is it.
Now what about people not on shared hosting, but rather on VPS or dedicated hosting? Well they aren't going to have APC or memcache installed by default. If these packages are on their servers it's because they installed them with an intention of using them. If they are doing that for a machine running Drupal, then chances are they know that Drupal can work with these engines (APC could be the exception to the rule since a lot of people aren't aware of the user caching in it and view it strictly as an opcode cache).
Of course that also reduces the number of people who would get the "recommendations based on currently installed software", since chances are they won't have the software installed unless they have installed it with the intention of using it on Drupal. Yeah there might be some cases where people have these systems setup for other sites they are running in other software packages that use these caching engines, but I do believe those situations are going to be rather limited.
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Well, the problem with the
Well, the problem with the way you view things is that you are building for today not innovating for tomorrow ;)
It does not hurt or bloat core to have these in core. More and more sites are being moved to vps and cloud services, shared Drupal hosting is becoming more and more versital and I see a future where shared hosts offer apc and memcache as add on services the same way media temple offers mysql containers. If we continue to build on what currently exists, we have no chance to set the stage for innovation and pushing the web hosts to move forward like we as a community did with the gophp5 project.
If we used the logic that most everyone was just shared hosting so why build for enterprise adoption, then we wouldn't be where we are today with many large enterprise businesses using Drupal. No offense, but I guess we will have to agree to disagree. I and many others work on big sites that are funding and driving the future of Drupal. I would like to see a future where people who don't have the resources my clients do be able to leverage some of the technology that we take for granted.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
My biggest thing is that I
My biggest thing is that I can't see what would be achieved by having these cache engines in core as opposed to as external libraries. I can see some benefit of being able to say "Hey - Drupal supports memcache out of the box!", but I still don't see setting up cache router today as being any more complicated as installing APC or memcache.
Even if we did go with APC and Memcache in core, we really should still maintain an API to allow other engines to be used, like Ehcache, or external hosting solutions such as Terracotta. To not have that would be a step backwards from where we are today, or even 3 years ago when D5 came out.
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Core Disadvantage
Any time something is added to core, it becomes much more difficult to patch / branch / etc. This is something to consider, and any module that plans on active D7 development really should not go into core.
There is also very little benefit to being added to core, so long as installing the module doesn't require adding a core patch. The core patch issues definitely should be taken care of, but otherwise the module system does a very good job of letting you add the needed functionality.
I think your secondary goal is just to make it so that people both know about the caching options and have easy access to installing them, but putting the code in core is hardly the only way to accomplish it. Once they are ready for prime time, Distributions should also make what is "core" less important.
Ken Winters
The other thing I should
The other thing I should mention is that I am not opposed to boost being in core as well, it is just not part of what I believe to be the core cache API. The reason for this is that boost only improves performance for anonymous users, if you use drupal's built in page cache for anonymous users and put something like varnish in front of it or use the pantheon stack then you will see better performance for anonymous users rendering boost just about useless.
But I think boost works very well in combination with these technologies and where you can't use something like varnish because of shared hosting.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
I understand the complexities
I understand the complexities of the session storage as I wrote the memcache session implementation. The storage of session lookup data can be added to any backend, we just need to work out the implementation details. I assure you this is doable.
Agreed about cache_add, but in my opinion, this should be atomic. We need some way to be able to guarantee a record gets inserted atomically.
I think there is a real benefit of removing cache_clear_all because of the ambiguity of the current implementation. I don't think that we are removing any functionality, we are just breaking it out into a cleaner API.
I have no idea why or where the empty function came from as this is new to me too. I did notice that it came from this thread from damz and dropcube. http://drupal.org/node/575360. I included it because of it's current implementation in d7 cache api.
I see lock/unlock as being part of a complete cache api. While I don't see it as being part of the current locking framework that I was a part of helping develop, I do see gaining exclusive access to a bin as something that developers could potentially use for creative purposes. I don't necessarily think that we need current examples for something to be beneficial for a feature complete API.
I think that there is benefit to having different caching technologies in core. Think of it like this; just cause not everyone needs postgresql and oracle or mssql drivers for their database doesn't mean there isn't benefit to having them included in core. In the same way, just because not everyone needs xcache, apc, or memcache doesn't mean there isn't benefit for users to be able to use this technology OTB. Are we a framework and a cms or not? Being a framework to me means that we should be feature rich OTB.
I don't know if anyone uses memcache.db.inc (and for performance sake i hope they don't). However chaining technologies for storage engines I think is perfectly valid for enterprise sites where cache data is imperative that it not be lost. Think of a Drupal site where the government had a critical form for soldiers reporting enemy data. If the form API validation cache got emptied and there was no recovery, the form would be invalid and the soldier might have to rewrite the data while under enemy fire. Although this is an extreme example, I can think of other industries as well that might need this kind of redundancy. Think banking, etc.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
I'm not clear on how
I'm not clear on how cache_set() and cache_add() would differ.
Ideally we'd be able to come up with a naming convention that would bring immediate understanding to what the function does. Reading the documentation would only be required to discover the details.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his
only of its not there
cache_set always sets, cache_add only if it's not present.
I am no expert on this
I am no expert on this subject but would like to share some ideas.
Since cache_delete_wildcard() is not possible with memory cache engines, perhaps it should be removed from the api so people wont use it. Or, not sure if this would work, add a $prefix parameter to cache_set so that we can keep track of all keys with the same prefix and be able to delete all those keys at once. For example,
foreach ($users as $u) { cache_set($cid = $u . "foo", $prefix = $u); cache_set($cid = $u . "bar", $prefix = $u); cache_delete_wildcard($prefix = $u); // only delete cache for one user }Also, please consider adding automatic cache stamped protection as in the memcache module.
--
http://ball.in.th - ชุมชนคอบอลพันธ์แท้, ผลบอล
That is not entirely true.
That is not entirely true. Memcache, apc, etc support wildcard wipes as long as you maintain your own index of what is in the bin. This is how "shared" mode works in cache router for memcache and how apc/xcache/eacc etc work.
I think that automatic cache stampede should be something that makes it into the API as well. My only question is where and how. I am not entirely sure where this should be at (engines, cache API, etc)
I don't think you should be troubled with adding prefix, cause that is basically for searching. It should be seamless.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
If I understand correctly,
If I understand correctly, without a specific prefix, wildcard wipes in memcache would have to loop through all keys to check if they match the wildcard. This could be a big problem with lots of keys.
If we can limit the wildcard to be just the prefix, we can keep track of the keys easily and no longer need to loop. For example, in cache_set() we could do Memcached::append('lookup' . $prefix, newkey) to append the newkey to a list of keys for a given prefix. And in cache_delete_wildcard(), all we need to do is Memcached::setMulti(Memcached::get('lookup' . $prefix), expired). $prefix should be optional and only used when wildcard wipes are needed.
--
http://ball.in.th - ชุมชนคอบอลพันธ์แท้, ผลบอล
slantview, thanks for your
slantview, thanks for your effort on this. I like most of what I'm seeing, especially getting rid of the WTF that is cache_clear_all().
I agree that core shouldn't try to support more than the most basic caching methods, though. Also, something like Boost in core sounds like an intriguing idea. How often do we see WordPress blogs being slashdotted/fireballed/boing boinged/etc because they aren't using any sort of caching?
The Boise Drupal Guy!
Brain-dead boost
The future of Boost is to make it dumber and make other modules (expire) smarter. Once that is done then it would be nice to see boost in core. Once this issue http://drupal.org/node/721400 is figured out then having a dumb file cache for pages makes more sense.
I think boost has been battle tested, thus I know most of its shortfalls & how to improve it. Biggest unaddressed issue is a full cache flush; if you have a lot of pages in the cache, the flush can take several minutes to complete (hard-drives are slow). Quick fix for this is to rename the folder and create a new folder with the old name. Then on cron or ___ delete the old cache folder.
The other big issue is the htaccess file & caching more then just html & gzip support. Right now htaccess is dynamically generated based off the users preference.
I think the best option is to make boost hook more into drupal's caching system. It still acts like a 4.7.x ported to 6.x module (which it is). With 7.x I could do a major rewrite and greatly benefit from it (get it ready for 8.x core). Making the file-cache option available and having a simplified boost module in core could be the right direction; could be an example module for other page caching systems. No matter what happens though, some extra logic has to be added to drupal. Once that page is in the cache, drupal is out of the loop & you must rely on outside forces to make sure your site's working correctly. This extra logic would also benefit varnish & nginx+memcache since requests to either one of these caches do not startup drupal.
Bottom line - Make page caching work flawlessly even when php is never run for a normal request.
Any insights you have and how
Any insights you have and how we could hook drupal's cache API into something that allows for tighter integration into boost and file caching would be great. Maybe we should have a performance summit soon and get a few of us together so we could do a big rewrite and some performance testing of Drupal including a cache API sprint.
Slantview Media http://www.slantviewmedia.com/ | Blog http://www.slantview.com/
That rename/create the new
That rename/create the new folder is exactly what I just implemented on Crooks and Liars about a month ago and it works great. I went a little further though beyond what Drupal would do. It renames the directory to old.XXX and then I got a cron job that checks for those directories and deletes them out late at night when the site is slow and using ionice.
HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.
Cache invalidation API
There was issue postponed to D8 http://drupal.org/node/636454
I think cache cleaning should be removed/replaced with cleaning with TAGS like it implemented in the most systems and frameworks. Wildcard cleaning is a RDBMS-way which is wrong for Cache
cache that can be tagged and
cache that can be tagged and clear by tags, instead of wildcards, would be even better. ^^
--
http://ball.in.th - ชุมชนคอบอลพันธ์แท้, ผลบอล
IMO, this issue is where most
IMO, that issue/topic is where most of the interesting, important discussion needs to happen. Much of this other stuff is related to storage backends and how we fill the caches, and those are things that've been fairly well covered. A good cache invalidation system is what we're really missing now, and what we really need.
dev boost
dev version of boost has a new block that shows what is connected to the node. Long story short play around with boost if you want to see some cool cache invalidation logic up and running.
Nifty, probably worth looking
Nifty, probably worth looking at whenever someone gets around to working on that D8 issue. But at the same time, not really the level I'm talking about - without a better cache invalidation API, there is simply no possible way for anything to be as smart as it needs to be. Nodes are just one of many things that potentially need invalidation, and the best anybody can do right now is try to string together hooks to create an adequate system of invalidation triggering events.
Cache Browser ?
I would also love to see an API to a) obtain the list of cache tables, and b) to query the keys stored in the cache. See the Cache Browser module.
Issue in the Cache Router module queue to integrate with the Cache Browser:
http://drupal.org/node/381000
I'm using CacheRouter and
I'm using CacheRouter and Boost for quite some time now. Both are very stable and feature rich and i'm more than satisfied with them.
I think moving boost to CacheRouter or at least making it to use CacheRouter API is the best place to start.