So, the fundamental problem: For a site using standard caching (in the database), it's very easy to cause the cache system to write huge numbers of cache entries, being duplicates of cached pages or other cache objects, so rapidly filling disk space. As far as I know there is no protection available against such an attack, other than table size limits in mysql (which is hardly an ideal solution).
This vulnerability has been around for a long time, nothing new here. Does anybody else think this needs to be addressed, whether in D8 or as an add-on module? I've searched but can't find any more discussion on the topic.
Copy of my post at: http://drupal.org/node/1245482#comment-7096812 ...
I found this issue by Googling for drupal page cache dos attack - it occurred to me today that URL variation could be a very easy way to bring a site down by using up excessive disk space.
GET parameters are not the only problem. Standard Drupal behaviour is to try to find a matching URL path, and serve the page which is the closest fit. So, this issue page is at http://drupal.org/node/1245482 but I can also get to it at http://drupal.org/node/1245482/something-here and http://drupal.org/node/1245482/something-else-here
So, on a standard installation, all variations of a URL would get cached separately - a (D)DOS attack could very easily start filling the cache (a database table in the case of a basic installation), loading mysql or whatever, potentially filling the disk.
But, the problem is even more extensive: Many modules and Drupal internals store objects in the cache, not just pages. Any time that an object is cached, and the cache-id is dependent on URL parameters, the same issue comes into effect - multiple cache entries of the same cacheable object.
So, in my mind, the correct solution would be to detect that an object is already cached - take an md5 (etc) of the object, keep a list of md5 values that we have already stored in the cache. When requested to cache an object, check if the same object (by md5) has already been cached. If it has, then store a pointer to that cache-object rather than storing the entire object in cache again. The space required to store the pointer could be very small - an index into an array of something like $cache_index[$index_value][$md5][$target_cache_id].
Then there is the issue that we might just happen to get an md5 collision. So we might need a "check for collision" option, which would mean retrieving the existing cached object and checking if it's identical to the object currently requested to be cached. If it is not identical, we need to cache the new object separately. So then we would need $cache_index[$index_value][$md5][$variant_id][$target_cache_id]. How to generate the variant_id? I don't know. Some other hash algorithm maybe. But what about a double-collision? So, a much simpler alternative would be to just flag that md5 as "not cacheable", and resort to old behaviour (always store object in cache) for any object which had a matching md5 - in practice this would "never" happen.
Some other complexities: We might need an "instance count" for the referenced cached object. Then, on a cache_clear request, clear the target object from the cache when the instance count reaches zero. We might choose to update the expiry data on an existing cached object when receiving a request to cache an identical object. Or probably better, we could store the new expiry in the cache index item. Some module somewhere might need to reliably read the timestamp for items which it has cached - this is possible with the current cache system, so should probably still be supported in some enhanced cache system.
Finally, we need to prevent the cache index from growing excessively (potentially the same DOS vulnerability that we started with). That's not too difficult - give it a maximum size, and clear out "least-used" values when it gets full. Ok, now we need an algorithm for that, LRU or LFU etc. - http://en.wikipedia.org/wiki/Least_recently_used - LRU is simple to implement, and probably adequate.
Additionally, we could potentially detect DOS attacks. If we find we are getting large numbers of requests at distinct urls which result in identical objects passed to cache_set(), then we have apparently have a DOS attempt. Offending requests could be ignored, redirected, returned a 404, etc.
Ok, this would all be considerable processing overhead, but in my mind it's the only way to address the underlying issue. Probably too much overhead to be included core, but perhaps as a module, for those that wanted it. The alternative approach of somehow detecting "valid" URLs is probably not feasible, for the reasons that others have already mentioned above (Views filters, etc.).
A much simpler alternative - just limit the total number of entries or total size of the cache, probably with independent limits per "bin". Use LRU or whatever to clear-out "old" items. The problem with that approach is that an attacker could effectively defeat the cache, leading to increased load, similar to point (2) in the original post above. But, it would at least solve the issue of using excessive disk space (which could otherwise crash the server etc). A simple mechanism like this could, perhaps should, be in core. "Advanced" users would be using Varnish, APC, etc, which can provide their own limits on storage size, but the majority of Drupal sites use the core system cache with database tables and so would be vulnerable. Currently such sites running D5/D6/D7 are vulnerable to this type of attack.