I'm new to the High Performance group and I'm interested to find out if people here have run a site with the following stats:
- 4,000 simultaneous visitors, up to 500 who are logged in at once
- approaching 2 million comments
- 200,000 forum posts
- 30,000 registered users
I'm currently working with a client whose porting a site over to Drupal with this level of activity. We've set up the test site using the Nitro system on Media temple on HP Proliant Servers, Intel Quad-Core Xeon - 2.33 Ghz and 8GB Fully Buffered (DDR-2) RAM. We've already done some custom optimization, but we're still seeing much more of a load on the server then we'd expect to see. We're beginning to wander whether this kind of set up is possible on Drupal's forum module. Can anyone out there testify first or second hand as to whether Drupal and the forum module can handle this?

Comments
"4,000 simultaneous
"4,000 simultaneous visitors, up to 500 who are logged in at once " How do you define "logged in at once"? "Requested a page in the past x minutes?
Defining simultaneous and logged in more specifically
Good question. On the current site, we define simultaneous as requesting a page within 15 minutes of each other, both for logged in and anonymous users. Does that answer it for you?
Makes sense
Makes sense, and Drupal has a similar way of determining "visitors" via the session table.
The issue is that the load is very different for anonymous and logged in users. For anonymous it is fairly easy to scale a site using boost, memcache and/or Squid. For logged in users it is not so easy.
So, if the majority of those 4000 users are anonymous, then it is not a big concern. If the majority are logged in, then it can be a concern.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Yes, it does. I asked
Yes, it does. I asked because a lot of people ask for "simultaneous users" but obviously they don't quite mean "simultaneous".
There is no single good definition of "simultaneous users" that I am aware of.
mysql> select count() from sessions where unix_timestamp(now()) -timestamp < 1560 and uid != 0;
+----------+
| count(*) |
+----------+
| 579 |
mysql> select count() from sessions where unix_timestamp(now()) -timestamp < 1560 and uid = 0;
+----------+
| count(*) |
+----------+
| 1934 |
This is from drupal.org a bit after peak hour. So it +- matches what you want, I guess.
Drupal's forum module does
Drupal's forum module does suffer from a few pretty bad queries in a few places which make running a large forum currently difficult (possible, as seen on a number of Drupal forums similar in size to what you mentioned - including drupal.org itself, but only through an excellent caching system). Though all forum pages are affected, the only show-stopping performance killing query is on the main forum home page.
Not long ago I did the following testing: http://drupal.org/node/314443 and also later re-tested with a bit more accurate of a process, and on server-class hardware (and the results were a lot better): http://drupal.org/node/314443#comment-1202018 (this is still with no caching). My hardware is about the same as yours, only not as much RAM (4GB), and a single server.
I have not yet successfully installed Memcache and some of the other enhancements that would be necessary to (evidently) bring the performance to an ideal/acceptable level. You can see from the example sites I mention in my aforementioned post that it is possible though. My understanding is several people are working on better solutions for this in D7. The key is that the various numbers (topic/post counts, new posts, etc) should be stored ahead of time instead of calculated on the fly. My previous system I'm switching to Drupal from does this (stores stats instead of calculating them in real time), and loads the same data in a few milliseconds that it takes Drupal up to 10+ seconds to do depending on hardware, so I know this would be a major improvement).
I hope this helps.
Thanks to everyone who
Thanks to everyone who answered here.
Kbahey, you're right that our big concern is logged in users. What would be the number of "simultaneous" logged in users. It sounds like 500 or so shouldn't be a major problem.
Gerhard, thanks for the stats from the Drupal.org site on simultaneous users.
Keyz, the performance tests you linked to are very helpful. We've actually been working on a customized solution that stores the topic/post counts, new posts, etc in a separate table rather then building on the fly. Of course, this will also seriously limit our upgrade capability...
Here are a few additional questions:
No fixed number
There is no magical number because it depends on many factors: how complex the site is, the number of modules, what are these modules, the data set sizes, the access patterns for users, what blocks are enabled, ...etc. Remember, Drupal can be customized infinitely, and therefore answers cannot be generic.
In benchmarks we've done for clients, we've seen the ability to handle 300 logged in users on 2 boxes (quad core each with 4 GB each, db on one, and app on the other), with each user visiting a page in 30 seconds or less.
This particular site had some heavy queries that took 1300 ms each when only one user was accessing the site. They will be rewritten to avoid that, since the site cannot scale with those.
By the way, if you use a node_access, and a large number of nodes, and permissions, then things can get bogged down quickly. Check how Drupal's node_access table can negatively impact site performance. In this case it was forum + node access that caused the site to slow down.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Running your database on a
Running your database on a separate server can make a big difference, and is generally a good idea when you start getting deeper into performance tuning. At the least, it frees the database from the heavy apache/php I/O and process management on the front-end server. You might talk to your hosting provider about the option of configuring your quad-Xeon box as two virtual servers (e.g. VMWare) with 4G and 2 cores each - a cheap and effective way to get the benefit of separate servers with no more hardware.
Yes, with some caveats
Yes, running the database on a separate box is good in general.
But you have to make sure that you have gigabit ethernet between the boxes. In some cases, we have seen the bandwidth between the boxes is throttled to 10Mbps, which is not enough for a typical site with hundreds of queries per page load, and even a small number of queries that return large data sets (e.g. the queries that return cached pages, menu cache, and even bloated session data in some cases).
Even worse is misconfigured hosts that count the local traffic between the hosts as bandwidth! Even a 100MBps is far better than the default of 10MBps that most cheap to midrange hosts provide.
If your site is not heavy on the database, keeping the database and PHP on the same box avoid all these caveats.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.
Quality hosters may be able
Quality hosters may be able to install additional NICs and even let servers share the same rack so you can use a dedicated communication path between db and frontend server. Worst which can happen is having alle the internal traffic get routed relatively slowly thrugh the datacenter or maybe even between different datacenters. Network latencys then may bite off every advantege you'd expect to gain by splitting up server duties.
Alex
Definitely important to
Definitely important to address the throughput between front/back-end servers. But I think it's best to think of this as a requirement rather than a problem - it's a standard topic that can be addressed with your hosting provider when working out your server configuration.
Also memcached++
I would also recommend implementing memcached as a back-end for drupals other caches (e.g. cache, cache_menu, etc) as this will keep your database from having to serve the variables and other application-level data for logged-in pageloads.
It tends to reduce DB load (these aren't complex queries, but they can take up time especially on complex sites where there's a large amount of application data) and speed page build times for logged-in users in my experience. This can go double if you're segmenting the DB onto another box (which is a good idea) because then this data doesn't need to be passed through that network I/O channel.
http://www.chapterthree.com | http://www.outlandishjosh.com
https://pantheon.io | http://www.chapterthree.com | https://www.outlandishjosh.com
Sorry to open up a dead
Sorry to open up a dead thread, but i've run into this a few times this week.
We have a database with close to 6 million rows in the term_node table, and 200k in the node table, so some of the less pretty queries in the forum (and advanced_forum) module, can bring our db server to its knees.
http://drupalcode.org/viewvc/drupal/contributions/modules/advanced_forum...
1013 $sql = "SELECT r.tid, COUNT(n.nid) AS topic_count, SUM(l.comment_count) AS comment_count1014 FROM {node} n
1015 INNER JOIN {node_comment_statistics} l ON n.nid = l.nid
1016 INNER JOIN {term_node} r ON n.vid = r.vid
1017 WHERE n.status = 1
1018 GROUP BY r.tid";
and
1044 $sql = "SELECT n.nid, n.title, n.type,1045 ncs.last_comment_timestamp,
1046 IF (ncs.last_comment_uid != 0, u2.name, ncs.last_comment_name) AS last_comment_name,
1047 ncs.last_comment_uid
1048 FROM {node} n
1049 INNER JOIN {users} u1 ON n.uid = u1.uid
1050 INNER JOIN {term_node} tn ON n.vid = tn.vid
1051 INNER JOIN {node_comment_statistics} ncs ON n.nid = ncs.nid
1052 INNER JOIN {users} u2 ON ncs.last_comment_uid=u2.uid
1053 WHERE n.status = 1 AND tn.tid = %d
1054 ORDER BY ncs.last_comment_timestamp DESC";
I was thinking of ripping this code out and replacing it with a simple variable_get(), then in a 2nd module, variable_set()'ing the value once every 5 mins.
This would then insure that these painfully queries are only run once every 5 mins, and not x30 when 30 authenticated users hit the page at the same time.
Is this the right approach or is there some other drupal kungfu i should be doing here?
a_c_m
Sounds reasonable, though you
Sounds reasonable, though you probably want to use cache_set() and cache_get() rather than with variables. You should also run EXPLAIN on those queries to make sure that can be executed in an optimal way.
But you might just be able to improve the query and submit your changes back to the project. In the second one I can't see why u1 is in there. And it might be much quicker to pull out u2 and just do quick individual queries to get the user names. You could experiment with splitting the first one into two separate queries as well.
--
Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his