Fork Drupal for High Performance purposes.

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
Riki_tiki_tavi's picture

Today Drupal becomes a very big part of the Web. It's a very powerful tool for creating a small or medium site of any kind with minimal effort. It may satisfy almost all business needs on the market.

But there is a big gap between business card website which have 50 visitors per day on 10 pages and a high performance site which should handle 1k+ visitors per day and serve 500k+ pages.
Right now I can see a lot of contrib modules which are solving some of the performance problems that will certainly appears while your site traffic is growing.

There are several different approaches for making Drupal core able to handle a lot of traffic, but I believe that optimizing layers outside Drupal core is not efficient. The core should be optimized as much as possible by design.

So why not to gather all those modules, use that experience and make a separate "Drupal-HA" core which will include all community solutions?

Right now I'm talking about D6 and D7. All the high-performance stuff seems not ready for production in D8 yet and seems like work on contrib modules (which will make D8 high-performance-ready) is only began.

Here how I see it:

  • Better caching API. D8's cache tags and cache contexts engine looks promising and may be implemented in some way by using procedural approach in D6 and D7 as separate cache implementations.
  • Include the D8 strategy about "caching all". In ideal a simple request should not make a database call at all. That will opens a lot possibilities for horizontal scaling.
  • High-Performance-ready installation profiles that will include Memcache/Redis/APC support out of the box.
  • Core-provided tools for cache warmup. I think when client opens the page it should be already in cache. If page is static then entire cache could be fetched by using default Drupal cache system. If page is dynamic then all the objects/arrays/data that are needed to build and render the page should be cached. It's crucial for user experience when "everything works fast" on your site and it also define positions of your site in SEO Ranking.
  • Provide entry point to separate database storages in the same way as cache storages. For example, all data which is needed for bootstraping Drupal may be stored in a small separate sqlite database file. All content-related stuff may be moved somewhere else, even to NoSQL storage thus Drupal could be less SQL-dependent. Results from this is that database connections, cache connections, other storage connections should be opened only on demand, that will give the additional speed boost.
  • Support of Master/Slave installations out of the box as in autoslave module by core (probably for Master/Master too). Database replication is a key for High Availability and Data Reliability.
  • Load modules by demand. When you have a big site with a lot of functionality with 50+ installed modules, each page request is making a vain inclusion of all of them. There is a workaround: menu router contains data which module and files it should load and module itself contains dependencies list in .info file. Also module_exists() and module_invoke_all() may load necessary modules.
  • Refashion and add into the core some developer tools and profilers in the way when they may be launched on a production site without hitting performance for your visitors. It could be very useful to discover bottlenecks that may appear only on production.
  • Make simple clustering tools that will work with other software for HA, e.g. health check url for HAProxy or Nginx.
  • Simple lightweight console that shows cluster nodes status and entire cluster status will be very useful. It also may monitor all resources status that cluster is using in a realtime, e.g. Database availability, Master/Slave status, Cache backends status, free space left on disks and memory, CPU usage. Also some custom pluggable user metrics via hooks should be supported. Yes, I'm aware about the bunch of existing monitoring systems but most of them are very heavy and hard to configure. Drupal knows about itself much more and may provide a complete working solution out of the box.
  • Entry point to change core elements theming. That will allow to include some modern libraries easily (like Bootstrap or Foundation).
  • Documentation and guidelines for different optimization strategies and environment setups. Examples in which cases one or another optimization strategy or tool may be suitable.
  • Since D6 will be closed on Feb 2016 and D7 support will be ended in 2-3 years, the both HA-branches should have a LTS (Long Term Support) feature.

For the better chances to persuade Drupal community to make a right decision I've gathered some statistics from our local Web counter about "big", "medium" and "small" websites in Kazakhstan.
As a "Big" site I mean a website with more than 50k visitors. "Big" site companies usually have a big developers department, they develop and support their own engines, they use a stack of different solutions and not interested in Drupal. As a "medium" site - all sites between 50 and 50k visitors per day, they are a "target group" for Drupal-HA.
Yes, our country is a very small part of WWW and not yet growed as well as Europe/US/Japan segments does but I think the statistics could be representative.
The numbers for one day are:

  • "Big sites": 36 sites total, 43,9 million hits per day total;
  • "Medium sites": 1316 sites total, 6 million hits per day total;
  • "Small sites": 2270 sites total, 92k hits per day total ("business card website");

As you can see there is a big number of small (possibly Drupal) sites but it's a drop in the ocean.
Drupal can't be used by "Web Giants" but a lot of the "medium" sites may use Drupal if it will possess appropriate tools and experience. Yes, in the "medium" sites list there may be Drupal sites with a lot of hits per day but they probably have only static content (with agressive caching) or already went through painful optimization process.

If you doubt in supporting of "obsolete" D6 I already done the nasty job and backported D7 tests into D6. That's my private D6 fork with performance fixes. I don't suggest to use them, it just can be an example how everything may work. Slightly modified D6 with core tests and core modules tests may be cherry-picked.

Thank you for the attention.

Comments

I mostly agree on the fact

Andre-B's picture

I mostly agree on the fact that core should be as fast as possible and support various caching methods to ensure fast page load times out of the box. Personally I think APC/Memcached stuff is something crucial for a stable release - apparently I am wrong with this opinion. On the other hand, fixing race conditions in the cache layer for high load websites doesnt seem to be an important thing, or I got one of the only websites running into this issue: https://www.drupal.org/node/1679344

Regarding your point on big sites: I think Drupal is a perfect fit for those as well, at least if you think of 50k plus visitors a day. the number of hits isn't that much of a problem if its mainly anonymous traffic and you can pretty much scale on a single server + varnish + cloudflare CDN to those big numbers. Drupal will have a tough time for dynamic traffic though, but with enough knowledge, a good team and a couple of servers that can be solved as well.

The keyword is "anonymous

Riki_tiki_tavi's picture

The keyword is "anonymous traffic". Unfortunately I don't have statistics about how many "big" sites have a static content which Drupal + CDN may handle (maybe it can be a large News hub or Wiki) and how many have a dynamic content like Facebook or Twitter. But the main topic is not about analysing the web :-)

Node save issues

mikeytown2's picture

You have 2 options at the moment, use https://www.drupal.org/project/cache_consistent or abandon non db caches and use https://www.drupal.org/project/apdqc
I went for the all MySQL solution.

Anonymous traffic is easy these days; our big D6 site (1,300+ US TV station hyperlocal news sites) with 6 apache rendering boxes and even more varnish boxes maxed out our 10GB uplink a couple times in it's life. You can throw lots of hardware at it to make the problem go away. That's what patch media recently did: https://pantheon.io/resources/patch-new-media-model-technology-case-study but it never reached the scale that our setup did: http://streetfightmag.com/2012/03/15/do-dataspheres-50m-uvs-make-it-a-mo...

D8 was built for dynamic content with big pipe rendering; back porting all of it would be a massive task.

Nope, I don't have them. We

Riki_tiki_tavi's picture

Nope, I don't have them. We are using PostgreSQL ;-)

Patch FWIW

outlandish josh's picture

FWIW Patch actually downsized their infrastructure footprint as part of that case study — from a large and complex custom AWS setup to Pantheon. The main benefit was that they'd get a sane way to develop and release new functionality (as well as getting off pager duty), but they're throwing a lot less hardware at the problem now that it's all tuned and optimized. ;)

More importantly, since re-platforming and getting the ability to innovate again they've seen significant growth, but the hyperlocal media game is notoriously spiky. If you're the go-to web destination for school closures, one solid set of winter storms across a region could make a blowout month easily.

In any case, If you need to build for a huge amount of unique traffic, you should probably pick a more modern tool. If you're stuck with D6 for the foreseeable future, you may want to consider a decoupled services-based approach: turn your existing site into a higher performance API and use something lightweight to assemble the final page for the end user.

All things to everyone

jamatulli's picture

I might suggest you become familiar, if you aren't already, with the history of PressFlow which is a high performance fork that has been largely incorporated into core. There are other forks (like Backdrop) that are out there as well and I'm sure you can get involved with but I don't see much of a point in forks of D7. It may be or may not be helpful and I think it is more often time spent rather badly considering the nature of community projects.

Also, D8 will see the majority of the most used modules ported and out of dev within a year to 18months if history is any indication. Contribution to D8 is where its at for high performance related activity since it is so much better already in that department. The same can be said for theme related and most other aspects of Drupal. D8 just handles everything much better.

In the mean time Drupal 7 is most definitely scalable for very high traffic sites without modifying core. In one case that I was directly involved with a very careful (and costly) professional market audit was done by a world renowned/household name accounting firm and Drupal was chosen to be the base for one of the largest purveyors of web content in the world for all of their many high traffic sites. I'm talking a few Alexis Top 500 to Top 3000 Global websites (Top 100 to 500 U.S.) as well as dozens of others that are pretty high up there. It is already in production across many of these sites with core unmodified and will continue to be implemented and used over the coming years as such. There are many other high volume sites that use it as well but D7 is only 4 years old and big legacy sites don't move that fast.

That said there is certainly room for improvement and implementing a high traffic site with Drupal 7 isn't easy. Please show something that is if you know of it. There are many choices and they all have steep learning curves and usually take big teams to get to where Drupal starts. Its not perfect...but that is what the next version is always about: getting better.

There are some things that are more difficult than others. The biggest factor of course is anonymous vs authenticated users. How you choose to handle the issues of your specific implementation are largely a factor of your specific needs.

There was obviously a lot of room for improvement in D7 which is why D8 is a departure in so many ways. It is far more scaleable, has much better performance ootb and its so much more extensible but as you said all the contrib modules aren't quite there yet. Still If you are starting a new implementation for a high performance website it is not as simple as "D8 isn't ready". It may be ready for your needs right away or with a little bit of your contribution it will get there quicker. The question is really is; Is it better to spend your time contributing toward a fork that few will use, contributing toward a version that may be legacy in another couple years or contributing toward a version that isn't quite perfect yet but that you can help shape and will last much longer?

PressFlow and Backdrop seems

Riki_tiki_tavi's picture

PressFlow and Backdrop seems a better start point than D8 for D6 sites due high complexity and paradigm shifting of D8.

I think a further discussion about HA-fork have no sense. I didn't knew about PressFlow so now I can see the light at the end of the tunnel. Thank you for suggestion.

Yeah I'm not sure I agree w/

btopro's picture

Yeah I'm not sure I agree w/ the premise that Drupal isn't for high scale. Possibly high scale of a certain kind, but contrib and hardware planning gets in there. It's cache backend flexibility is what really shines. http://www.examiner.com/ runs drupal w/ tons of authenticated and anonymous traffic from a global audience. whitehouse.gov , psu.edu, etc. all high traffic sites.

Now, that doesn't mean we can't make D7 faster. I think looking to D8 you'd be incredibly jealous of the BigPipe and RenderCache capabilities. The Render cache in D8 has a backport that was initially tried out for D7 which needs some love to increase accuracy / simplicity of setup but can work in many instances (drupal.org uses it for comment cache invalidation on largely static pages) -- http://drupal.org/project/render_cache

I think a D7 disro focused on speed that comes with the following out of the box could be a good idea to put efforts towards (internally we've talked of doing a PureSpeed distro.
- Include common / popular cache backends out of the box (apc, apdqc, redis, memcache, varnish)
- Include performance minded core patches (there are 3-4 out there now that have dramatic improvements but haven't been accepted yet)
- Include common / obvious contrib modules well optimized OOTB (httprl, advagg, authcache, entitycache, blockcache_alter, expire)
- Include well documented example settings files of these all working together
- Include an install profile that loads and tunes all common contrib projects to provide a faster starting point
- Built in cache warmers / recommended drush plugins (like xmlsitemap, drush_ecl, httprl_spider, xmlrpc_page_load, etc)
- Recommended crontabs and elysia cron support for less aggressive cache purging OOTB
I don't think this would take much effort and would be very useful for getting people in the right direction from a scale perspective.

Other things can be borrowed from things tagged purespeed on DPE -- https://drupal.psu.edu/blog-post-tags/purespeed . We'd be happy to contribute some ideas towards a purespeed distribution if there's interest.

I think Drupal 8 is much more

miro_dietiker's picture

I think Drupal 8 is much more what you are looking for.
Core is simply behaving much better and knows much more (if not everything) about clearing. Cache contexts, cache tags, ...

Many of the D6/D7 feature modules are in Drupal 8 core or simply no more needed.
And most things you need are already ported for Drupal 8. We have been working on many modules and the whole community is very active. It will not take a year.

We already started two huge D8 media sites (e.g. http://www.letemps.ch/) and Drupal did a great job for us.
If you want to build something exciting, take Drupal 8 and start in a much more advanced domain.

One last thing, if you are interested for more specific cache clearing of lists: https://www.drupal.org/project/views_custom_cache_tag

Thank you for suggestion but

Riki_tiki_tavi's picture

Thank you for suggestion but D8 is not exactly what we need.

We decided to upgrade our D6 site to Yii 2. It possess all the performance improvements that we need ootb:

  • 0 database requests for bootstraping. We may even not use a database for the most pages (sphinx will do the job).
  • Great performance. 7ms for plane page vs 100 ms for Drupal 8 (both tests with OPcache turned on, Drupal was installed on "tmpfs/in-memory" database).

Moreover we have a lot of Yii 2 developers in our city. I'm not sure that we'll be able to find experienced Drupal 8 developers yet.

Hi all! I recently finished

gor's picture

Hi all!

I recently finished render boost module for Drupal 7:

https://github.com/itpatrol/render_boost/

It's drupal 7 render cache (boost) module. It store cache rendered data in database and if render array is the same (md5sum algo) it uses cached data instead of re-render it every time. Work smart with forms. Require core patching.

You can see results here:
http://drupal7.drupal-site.net/node/199 (no cache, drupal 7) - after opcache warmed up - ~130ms
http://rb.drupal7.drupal-site.net/node/199 (only render_boost enabled for caching) - after opcache and render cache warmed up - ~60ms

You can register and test the same page for authorized user.

I did a test for authorized drupal 8 website the same page:
http://drupal8.drupal-site.net/node/199 - default cache enabled warmed up opcache warmed up - 150ms

Hosting server details: php5.6 memory limit 512M, CPU E5-1650 v3 @ 3.50GHz, HW SSD RAID5

Vs render_cache?

DamienMcKenna's picture

How does it compare against the existing Render Cache module (https://www.drupal.org/project/render_cache)?

Render_cache is caching nodes

gor's picture

Render_cache is caching nodes only.
Render_boost is caching all render() calls. So it cache all rendered elements, not only nodes.

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week