High volume Drupal sites - what do we need to know?

mombee's picture

I've been pointed in the direction of the High Performance group to discuss my requirements (http://drupal.org/node/317535). I’m looking at building a gallery site that runs the following predicted metrics each month:

• 320k visits
• 200k unique visits
• 5m page views

• 14k uploads
• 16Gb total uploads

Anecdotally I'm being told that running this on Drupal is a risk, but all the documentation suggests that as long as we have good (dedicated) hardware, good 'normal' website tuning practices and a caching facility, then we should be OK.

When I'm evaluating a development partner, what questions should I be asking in terms of their understanding of high volume sites to make sure that they build Drupal in the right manner to make our site fly??

Cheers, Mombee.

Comments

Well crap. I just posted

Jamie Holly's picture

Well crap. I just posted this in the forums. Didn't realize I clicked the wrong link. Here's what I said:

There's a lot of good help in the high performance group. I just finished moving a high traffic site from WP to Drupal this weekend (crooksandliars.com). We average about a 1/4 million visitors a day and 400,000 page views. Yesterday while monitoring we had periods where we had 6,000+ guests and 50+ logged in users on the site and the servers were sitting almost idle. We also had 1 hour where we hit over 30,000 page views and the servers never broke a sweat.

While on WP these kind of hits brought us down instantly, even with WP caching plugins. We average about 2,000 comments a day on the site, so that creates problems. One of the biggest problems was flock waits on the cache files for WP, since they do all their caching via the filesystem. On high traffic sites with constantly updating content, this is something that shouldn't be used.

We are running off two quad core machines. The first one has 8gb and handles Apache + memcache. The second also has 8gb and handles MySQL plus our static content. There are some hacks you can do to the core in order to get all the static content served from a static subdomain (and I think plans to include this option in future versions of Drupal are in the works). Basically how it works is we have an rsync script firing every minute that updates the static directory tree on the static server with the entire Drupal directory structure. We use excludes so all the files PHP, INC, MODULE, etc files aren't copied over. I also have a mod_rewrite in the static directory that checks if the file exists - if not it rewrites it back to the Drupal server. This helps when we have a lag running on the rsync, especially with things like javascript and css aggregation files.

For memcache, we are using the cache router module. This thing has been a lifesaver. Right now we are using the most basic configuration for it (a single cache bucket and one server). As traffic grows we will be expanding that.

Another hack I ended up doing to the core was the path lookup. I have it ignoring a bunch of paths like the comment edit, admin and other ones that won't be aliased. I also have the other paths caching in memcache, which greatly reduced the number of queries we need.

I keep our own Drupal with hacks under version control so I can easily create patch files and apply them when new versions of Drupal are released. Same goes for any contributed modules we are using, which is only a couple

Depending on your module needs, it can be beneficial to weigh the needs of the module versus writing your own. Drupal has a lot of great modules out there, but a lof of them also end up using a lot of needless resources. This isn't a fault of the module designer, but rather an evil necessity of having modules that can be customized to fit any site. If it's something you can write easily enough, do so. I ended up with about 12 custom modules for this site. Most could have come from contributed modules, but it was much more beneficial on the server end to write my own tailored to our exact needs.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

Wow. Thank you for this

Glad it helped. Something I

Jamie Holly's picture

Glad it helped. Something I forgot was to make sure you check important queries like the front page query. While we are using the basic default query for the front page (select by status, promoted, ordered by created, stickied), still I ended up adding an index just for that query which saw an enormous performance boost. I actually have a patch in to add this index to core. Without it MySQL resorts to doing a file sort on that query. While optimizing that I also wrote a small module that actually did the front page query and has it's own pager_query function in it I use in places I need it. Essentially my pager_query (crooks_pager_query) function is the exact same as the on found in Core (I just copied it), but I changed is so that I could retrieve the found records count via a different means. I have the total records count saved as a variable. It gets updated on node saves/deletes and every five minutes via cron. That prevents a count running on the nodes table (which has over 30,000 rows right now) on every index page view. The benefit of this depends upon how often your node table is updated. Our site that table is constantly updated throughout the day with people working on posts, so it really helped. If your node table isn't updated that much then it might not since query cache would kick in.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

Static caching

juan_g's picture

intoxination wrote:
>There are some hacks you can do to the core in order to get all the static content served from a static subdomain (and I think plans to include this option in future versions of Drupal are in the works).

Currently, for the static part, it's possible to obtain similar remarkable results in speed and performance -without core hacks- by installing the Boost module (static page caching for non-logged in visitors), which also uses mod_rewrite like your solution. (There is a working patch to port Boost to Drupal 6).

Very interesting post, by the way.

I had actually looked at

Jamie Holly's picture

I had actually looked at using that and used it on my personal site when it was on Drupal 5. Now I am using cache router configured to use the file system. My personal site doesn't see that much traffic though. I usually only get a couple hundred hits a day, but sometimes I get a big link, like once when CNN linked me and I got about 6,000 hits in 1 hour. At the time I was on shared hosting and had Boost running then and the pages still pumped out with no delay.

I decided to not use Boost on this site since we were using a memcached server. With cacherouter we can use page fast cache and the pages are stored in memcache. We are running memcache on a separate machine and only seeing about 500 meg of the available 1gig ever used, so it could easily run on the same machine as Apache.

A nice benefit of this is that we can add in another server with no problem. Using file based cache you have to use a shared filesystem so you don't end up with stale data on one server. We actually had to do that Tuesday night during the debate when we saw close to 50,000 hits in 1 hour. At that time we also had a ton of comments posted and over 100 logged in users on the site. Hopefully in the next few weeks we are going to have a load balancer in place and can just keep up the two servers (or more). The rough part on a site like this (political) is that you can sit basically idle for days on end and then some big news story comes out and your traffic jumps over 1000% within minutes.

It's possible to serve static items via mod_rewrite (images, js, css, etc), but the problem we had was our fallback on the static server. Essentially a small mod_rewrite rule checks for the file and if it isn't there then it sends it back to the server with Drupal. If we used mod_rewrite on the Drupal server to rewrite static to static.crooksandliars.com this would result in an endless loop if the file wasn't on the static server.

The hack we ended up with was very small and since caused no real problem since we already did a hack on the path lookup functions. Everything is contained to common.inc. When I did the 6.5 update this week that patch applied perfectly, so we were able to deploy 6.5 within 10 minutes of it being released. This was another big benefit over Wordpress. We had so many hacks going on in there just to get things running the way we wanted that every security update turned into a 2-4 hour project to deploy - even with patches. I don't miss that at all.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

These figures are nice

kbahey's picture

These figures are nice and achievable.

If you limit the number and type of modules, and blocks on the site, you can do better.

An example is a site that gets 1,029,771 page views per day, and exceeds 45,500 page views an hour at peak times. This comes to 21.8 Million page views per month.

This is on a SINGLE SERVER, not split on more than one machine (Dual Quad Core Xeons, 8 GB, separate disks for various things).

Memcache and APC are indispensable in these cases. Custom patching is needed, but not too much (session writes does not happen for every page view, white list for aliases).

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

I am a huge crooks and liars fan

captain sisko's picture

have been for years - I check it regularly for the latest, best political video clips - I'm pleased to know it runs drupal now! Can I ask you a question? Why don't you host all of your videos on youtube?

Well, for starters C&L was

Jamie Holly's picture

Well, for starters C&L was offering video before YouTube as around. We also have permission to use videos that normally get deleted on YouTube. We do use YouTube for a lot, however, but we also offer the download formats for people wanting to share them easier (or those who are dead set against Flash media). In the near future we will be going to our own embed video player, but will still have the download formats available.


HollyIT - Grab the Netbeans Drupal Development Tool at GitHub.

If this are really monthly

Etanol's picture

If this are really monthly stats than you're pretty safe. I have a site which with a proper cache setup running on a single low end dedicated server (1x dual core xeon, 2gb ram, 250gb sata drive) handles 400k uu and 7M page views a month without memcache. Granted, there are way less uploads ( ~2k ).
Of course all that is site speciffic - how 'heavy' is the site, logged in vs guests ratio, distribution of page views (mostly frontpage & linked from front page vs even spread across the content), etc. For example in your case you have to consider server load caused by image resizes.

Drupal.org 15 milion/month +

Amazon's picture

Just an FYI, there's plenty of Drupal sites above 20 million per month page views. Obviously, there's a lot of LAMP tuning that's independent of Drupal which needs to occur. Check out: http://tag1consulting.com/Drupal_Performance_Agency

For Drupal performance tuning documents.

Kieran

Drupal community adventure guide, Acquia Inc.
Drupal events, Drupal.org redesign

5 million per month is doable

kbahey's picture

Well, here is an answer to the question: Can a Drupal site handle a million page views a day?.

So, 5 million a month is certainly doable. Some articles that will help are in our Drupal performance tuning and optimization for large web sites section.

The devil is in the details though. Each site is different, and anything more than generic rules cannot be provided without getting into specifics.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Khalid, would you be so kind

meba's picture

Khalid, would you be so kind and provide a link for the sessions write patch if such an issue exists? Thanks!

Simple ...

kbahey's picture

It is actually simple, but helps a lot in not having contention for session writes.

It is a backport of this code from HEAD in session.inc.

<?php
     
if ($user->uid && REQUEST_TIME - $user->access > variable_get('session_write_interval', 180)) {
       
db_query("UPDATE {users} SET access = %d WHERE uid = %d", REQUEST_TIME, $user->uid);
      }
?>

Check the current HEAD version for this snippet and change your older Drupal to it.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Where is definition of

gansbrest-gdo's picture

Where is the definition of REQUEST_TIME - I've searched my code base for that constant and cannot find it.. Could u explain please?

use time()

kbahey's picture

For Drupal 6 and earlier, replace RQUEST_TIME by time().

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

Some clues

pmichelazzo's picture

Dear friends,

I'm looking and reading this tread very carefully because is very important for me. Today I have a client with a strange problem. They have 1.2 million access per month and cannot put some "bombs" in the homepage because the number of access in peaks stop the server (something like 500 users at same time).

In this scenario, what can I check to solve this problem? Take off the MySQL and put in another machine? Open more connections in Apache server? Some clue about it?

Just in case, the server is a Quad-Xeon with 8GB of RAM.

Thanks a lot

Paulino Michelazzo
http://www.michelazzo.com.br

Paulino Michelazzo
http://www.michelazzo.com.br

Yes, I'm Brazilian and we don't speak Spanish here (but I can talk too).

hmm...

PlayfulWolf's picture

From my non-drupal experience, I would recomend to start from "low-level" - that means hardware itself, which I think is more than sufficient, then move to mysql (query cache, key buffer size, slow queries..), then - php settings, php caches (APC and similar), apache, memcached and just then Drupal modules and Drupal caching.

Of course, that would be the way for me, because I do have more LAMP experience than Drupal ;) Also, doing everything botom->up way is less chance that you will need to rebuild your site.

IMHO, mysql on separate server will give you twice the problems and longer access times, with 8Gb of ram you can make your site fly!

What are the pageviews/day and hits/day?
Average Mysql queries/second?

BTW http://2bits.com/ - great resource on Drupal tunning!

drupal+me: jeweler portfolio

Buying hardware should be the last thing

Coornail's picture

I disagree. Hardware is expensive, caching is cheap.
First check out APC, it can really boost your page generation time.
Then take a look at the mysql config, there is a good script for that.
After that go for the apache.conf. Take your memory you want to spend on apache processes, check out how many memory one pageload takes (it's really reduced now, if you already configured APC), and bang this is your MaxClient variable.

If you're really sure you did everything, then you can go and buy more hardware (more memory always makes wonders).

Investing in hardware should

Etanol's picture

Investing in hardware should be only considered if you're absolutely positive that tuning software is no longer an option:

A freshly installed LAMP + Drupal with no caching vs highly optimised MySQL + lightweight webserver (lightpd/litespeed/nginx) + php compiled with only the libs you need + oo cache (apc/xcache) + memcached + drupal cache + drupal advanced cache + block cache

In majority cases means difference by an order of magnitude or, if majority of your users are not logged in, two orders of magnitude.

A $150 dedicated server if properly configured can overperform two $900 servers (double the price rarely means double the power) with basic LAMP with default settings and I am ready to prove it

well.. seems you haven't

PlayfulWolf's picture

well.. seems you haven't read the post correctly - I have experience in non-drupal scripts, which useed to act unexpectedly with caching. I am really conservative about high-level caching.

my point of view was to choose - start from purelly Drupal caching stuff or LAMP stack (which is my case...) ;)
for the majority of Drupal developers out here may be easier to understand Drupal performance modules/options.

drupal+me: jeweler portfolio

Hardware is cheap, System admins are expensive

Amazon's picture

What is cheaper, adding 6GBs of RAM, or paying your System Administrator/Developer how to learn to tune a LAMP stack?

For example: Small instance(1.7GB RAM) on AWS $72 month. Large instance (7.5GB) $288 month.

So for $212 month you get 6GB more RAM per month and just increase your Apache, PHP, and MySQL basic configurations wildly.

Alternately pay a system administrator, or staff or at the data center how to tune a LAMP stack at $75/hour. Are you sure they will have tuned it correctly after three hours, or will they bill you more?

Developers often think their time is free and don't account for their learning in the costs compared to hardware.

Case in point, I estimate the Drupal infrastructure team probably spent about 20K in billable hours (all volunteer but market rates) tuning Drupal.org. Finally, enough was enough and I pulled out the credit card and we bough $2000 in RAM for our webservers and database servers. Voila, no more optimizing, we threw hardware at the problem and it went away. Of course hardware with highly competent tuning is the preferred course.

For a good primer on tuning your LAMP stack start here: http://tag1consulting.com/performance_checklist

But don't be afraid to throw RAM, CDNs at the problem.

Kieran

Drupal community adventure guide, Acquia Inc.
Drupal events, Drupal.org redesign

Opcode caches can be a

Coornail's picture

Opcode caches can be a really big improvement on performance (1, 2). The test show somewhat 3x performance boost, but I've seen even 8x too.
It's also really fast and easy to install: apt-get install php-pear php5-dev apache2-prefork-dev build-essential && pecl install apc && apt-get remove php5-dev apache2-prefork-dev build-essential && /etc/init.d/apache2 restart.

In my opinion:
Is it cheaper to buy more ram? No (To enter these lines it takes less time then filling out a form to request ram modules).
Is it easier to buy more ram? Yes.

Not entirely true. Basic

Etanol's picture

Not entirely true.
Basic tuning is fast. You can have reasonably well configured mysql + installed & configured xcache/apc + lightweight webserver setup + php recompiled + some basic stuff in Drupal (just installing & configuring 3 modules) in less than 2 hours - I can tell it can be done, because I had to do emergency migration for a moderately sized site and managed to pull it of.
This basic tuning increases performance 5-15 times, depending on site, compared to default LAMP configuration.

Also this delays the need for horizontal scaling. Once you're beyond 2 servers (separate database and separate webserver) you will need System Administrator anyways.

80/20

kvantomme's picture

Ah, a clasical case of the 80/20 rule:

I think both of you are right:
-First do that 20% of development work that Coornail is talking about that will get you 80% performance improvements.
-Then throw hardware at the problem cause you'll only get relatively small performance gains from further tweaking.


Check out our new company blog on http://www.pronovix.com/blog

--

I blog and Tweet

Traffic spikes on a hot

Etanol's picture

Traffic spikes on a hot story usually mean that 99% of extra users viewing the site aren't logged in. If that is the case you have a wide range of options starting from content caching (advanced cache patches), memcached, static file cache (Boost module) to reverse proxy (with a cookie check) - if you choose reverse proxy rather go with varnish (http://varnish.projects.linpro.no/) than squid.

If you need any more speciffic information let me know.

Panther CDN for anonymous spikes

Amazon's picture

http://www.pantherexpress.net is the cheap and most popular alternative. Several big sites serve 100% of their anonymous Drupal traffic from CDNs.

There's a patch out there to do this, but I couldn't find it immediately.

Kieran

Drupal community adventure guide, Acquia Inc.
Drupal events, Drupal.org redesign

We're using something similar

christefano's picture

We're using something similar to Ted Serbinski's patch for Digital Dollhouse.


Exaltation of Larks
Founder, CEO
http://www.larks.la  
Droplabs
Founder, Lead Burrito Analyst
http://droplabs.net  
Greater Los Angeles Drupal
Organizer, Drupal Adventure Guide
http://drupal.la  

Are the requests

christefano's picture

Are the requests we're talking about regular page views? Digital Dollhouse is a site we built that uses Services and it's not straightforward when measuring the number of requests since our logged in users are using the Flash-based interface most of the time.

I'm curious about people's use of jMeter to measure Services requests and not just traditional page views. jMeter has a proxy mode for cases like this but I've forgotten what it's called.


Exaltation of Larks
Founder, CEO
http://www.larks.la  
Droplabs
Founder, Lead Burrito Analyst
http://droplabs.net  
Greater Los Angeles Drupal
Organizer, Drupal Adventure Guide
http://drupal.la  

subscribing

Himanshu's picture

subscribing

<a href="http://www.iputech.com/" title="Connecting Indraprastha University >Connecting Indraprastha University | IPU Tech

Do not post messages with

moshe weitzman's picture

Do not post messages with just the word 'subscribe' in them on groups.drupal.org. thats a sorry convention on drupal.org which we do not tolerate here. there are many many email and rss notification options on this site. please use those. thanks.

I apologize

Himanshu's picture

I didnt meant it that way. Just want to save that link. I see other people doing so I followed.
I'll try the other ways. Thanks for your advice.

<a href="http://www.iputech.com/" title="Connecting Indraprastha University >Connecting Indraprastha University | IPU Tech

.

entendu's picture

Rawr, watch those claws, Moshe ;)

none binary OS FTW

likewhoa's picture

My steps to setting up a high performance web server involves a from-source Linux distribution like Gentoo Linux http://gentoo.org with an optimized kernel and LAMP packages.

My personal experience is that optimization should begin with decent hardware i.e $100 dedicated host "dual-core or more & 2GB+ 8GB RAM prefer" then focus should be on OS and software optimization before considering investing on better hardware unless you got the money to burn and just don't believe much in software optimization. for example I've been implementing linux software raid on all my new dedicated servers vs buying a hardware raid controller which averages around $500 when using software raid is just as good if not better.

My point is that you should never assume that just expensive hardware will help, instead you should focus on software optimizations for performance tunning. My choice of an httpd server so far has been lighttpd unless a client absolutely can't use it because of some third party apache modules they use otherwise lighty all the way. So when looking for a development studio for your drupal site, make sure they know about both sides of the performance tunning world that being software and hardware optimizations.

bending technology to fit businesses.

nice progress

drupalsmutant's picture

now reach how many visit?

High performance

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week