Drupal 6 site Crashing every few hours ( High Disk I/O)

Events happening in the community are now at Drupal community events on www.drupal.org.
manuj_78's picture

Background: The Drupal 6 site used to work very well on a VPS until a few months ago. However in order to reduce the server cost as well as to improve performance of the site even more, we decided to move to the cloud and Pantheon. We setup a pantheon server and moved a copy of the site on rackspacecloud and tested it using Jmeter with 1000 users and tested continuously for 4 hours and the load never went more than 1. The Jmeter report showed only .4% error(which was normal i think).

So we moved our main site to pantheon and cloud..however as soon as we moved the main site we ran into this problem of site crashing every 3-4 hours.

We have now tried switching back to drupal and even setting a completely fresh Cloud CentOs 5.5 server but the problem does not seem to go away

Site Traffic: We tend to have around 2000 users to the site every day with around 10-12 nodes added to the site every day

Server:
Drupal 6.17
CentOs 5.5
1024MB RAM
Linux 2.6.33.5-rscloud on x86_64
APC
ImageToolkit: gd
MySql: 5.0.91
Php: 5.2.13
Php memory limit:394MB
Web server Apache/2.2.15 (EL)

Problem: The site seems to be crashing every few (3 to 4) hours as the 1024 MB RAM is completely used and there is very high Disk I/O Usage. Can someone please guide us in the right direction on what we can do to isolate the problem. Any tools we can use to figure out what can be the cause?

One more thing I would like to check... What all does mercury profile changes?? I just want to check after upgrading to mercury profile is there any problem if we go back to basic drupal.

Is there any chance mercury changes something on the website?

Comments

Mercury just packages an

hunvreus's picture

Mercury just packages an optimized stack for running Drupal; Varnish + Apache 2/PHP 5/MySQL5 + Pressflow (Drupal) + Memcached + Solr. Pressflow is no more than a patched version of Drupal for performance, which only support PHP 5 and MySQL and adds support for a few additional optimization centric features (reverse proxy, Master/Slave replication). You should have no problem switching back to a simpler architecture.

Now you're saying you ran into problems when you made the switch to it;

  1. Did you roll in any new features along with the switch to Mercury?
  2. Do you have any recurrent event (like a cron) that would run every few hours?
  3. Can you provide more specific data on the RAM use? Did you try to track processes that are eating up your memory? Are they building up over a long period of time or do they suddenly skyrocket?

I hope you have a decent sysadmin that can give you the right metrics to identify your issue.

--
Wiredcraft (http://wiredcraft.com) - Building Web and mobile solutions using Open Source technologies.

No new feature was added to

manuj_78's picture
  1. No new feature was added to the site as we did not want to add another unknown if something went wrong
  2. Cron is scheduled to run every hour
  3. Here is the output from the running processes sorted by memory use

ID Owner Size Command
1191 apache 671408 kB /usr/sbin/httpd
1210 apache 669872 kB /usr/sbin/httpd
1193 apache 665220 kB /usr/sbin/httpd
1223 apache 662568 kB /usr/sbin/httpd
1194 apache 661844 kB /usr/sbin/httpd
1227 apache 661816 kB /usr/sbin/httpd
1226 apache 653208 kB /usr/sbin/httpd
1254 manuj 653124 kB /usr/sbin/httpd
1205 apache 652536 kB /usr/sbin/httpd
1192 apache 652204 kB /usr/sbin/httpd
1255 apache 651652 kB /usr/sbin/httpd
1196 apache 651112 kB /usr/sbin/httpd
1197 apache 648760 kB /usr/sbin/httpd
1195 apache 648616 kB /usr/sbin/httpd
1198 apache 648420 kB /usr/sbin/httpd
1143 root 612252 kB /usr/sbin/httpd
1065 mysql 509056 kB /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-f ...
1151 root 74824 kB crond
1018 root 65944 kB /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/my ...
981 root 62632 kB /usr/sbin/sshd
2959 root 51968 kB /usr/libexec/webmin/proc/index_size.cgi
1092 nobody 51548 kB proftpd: (accepting connections)
1199 root 48852 kB /usr/bin/perl /usr/libexec/webmin/miniserv.pl /etc/webmin/miniserv.conf
2963 root 48852 kB /usr/bin/perl /usr/libexec/webmin/miniserv.pl /etc/webmin/miniserv.conf
963 haldaemon 30008 kB hald
1178 avahi 23156 kB avahi-daemon: running [srv.local]
1179 avahi 23156 kB avahi-daemon: chroot helper
964 root 21700 kB hald-runner
938 dbus 21260 kB dbus-daemon --system
1169 xfs 20268 kB xfs -droppriv -daemon
386 root 12612 kB /sbin/udevd -d
3013 root 10796 kB sh -c ps --cols 2048 -eo user:80,ruser:80,group:80,rgroup:80,pid,ppid,pgid,pcpu, ...
3014 root 10464 kB ps --cols 2048 -eo user:80,ruser:80,group:80,rgroup:80,pid,ppid,pgid,pcpu,vsz,ni ...
1 root 10356 kB init [3]
926 root 5916 kB syslogd -m 0
929 root 3812 kB klogd -x

The processes build over time.

Additional question: The site is now running under /home/manuj/sitename.com...is it by any chance supposed to run under /var/www/ and would that make any difference?

Am I correctly reading your

hunvreus's picture

Am I correctly reading your post; you have about 15 concurrent Apache processes eating up 650 MB each? Why did you even have to raise your PHP memory_limit to as much as 394 MB? What would you be running that would eat up that much memory?

--
Wiredcraft (http://wiredcraft.com) - Building Web and mobile solutions using Open Source technologies.

That is exactly where we need

manuj_78's picture

That is exactly where we need help in tracing what could be eating up all the memory as we have no clue.

This is what my Sysadmin had informed me yesterday:

We have set max of 15 apache clients simultaneous apache process. Each process should cater to 4000 client request and then die! In our case some of the php code uses this client and does not close the connection.. So we have all the 15 processes catering 4000 connection each which are not closed.(happens after couple of hours running)

Each of the above client uses memory and utilizez the full RAM allocated and then moves to swap.

So we need to trace out which php code (or module) inside drupal is causing the never ending session.

You need to tweak that alot.

jmccaffrey's picture

You need to tweak that alot. 4000 is way too high for PHP, lower it to like 100. You are seeing a php memory leak eat your ram. That is why it's so high per process.

That's not something that is

hunvreus's picture

That's not something that is going to be solved easily through the groups; you're going to need to identify where in your app is the problem. That means doing a complete audit of your app; is that coming from the theme? or from one of the contrib modules you used (there are some well known modules that can be problematic)? Is that the custom code you added?

Actually, as a starter, it would be helpful to know if you have custom coded modules and your theme.

--
Wiredcraft (http://wiredcraft.com) - Building Web and mobile solutions using Open Source technologies.

No new feature was added to

manuj_78's picture

Edit: Duplicate comment removed

More information about your architecture?

spearhead93's picture

Do you use views?
Which caching solution are you using?
What is the cache lifetime that you are using?
Have you turned on slow queries log: you may see Apache processes piling up as Mysql is not able to cope with the load at peak time?

repoting

sreyas's picture

Hi,

Sorry to jump in late.

I am the one who is trying to tweak the server to best performance and have been trying different values for apache, php, mysql settings. None of them worked yet...

  1. Yes our drupal theme is custom designed.
  2. Yes we use views extensively.
  3. We have tried, boost, extenal caching using memcache.
  4. Min cache life is set as 1hour.
  5. We have tried tuning on the slow queries in mysql but none are reported.

the main error we see when the load goes high is client does not exit correctly, sending SIGTERM.

But again one problem is there this same site worked perfectly on another VPS with cpanel...and now we are on a cloud and dont even have a performance near to the earlier VPS

Regards
Sreyas

I am not exactly sure as to

hunvreus's picture

I am not exactly sure as to how exactly you're approaching that problem, but this kind of issue requires a very formal and structure process; you need to define where the bottleneck is hitting you and what is its source. Blindly tweaking the Apache/PHP/MySQL configuration without really knowing the impact or reason for it will probably lead nowhere.

Do you have any experienced sysadmin that can help you track this down? if not, I'd recommend to find one as soon as possible.

--
Wiredcraft (http://wiredcraft.com) - Building Web and mobile solutions using Open Source technologies.

I agree

sreyas's picture

@hunvreus: I agree blindly changing the conf does not do any good, but for finding out the best values for this particular webserver we do need to try it out right??

Anyway we have been trying to debug for some time.. actually not much errors comes to any of the logs.

/var/log/messages says memory ran out, thats because apache is eating up all the RAM+SWAP
apache error logs does not say any error, but just hangs at where it shows client did not exit, sending SIGTERM.

Actually we tested the server perfomance using jmeter and it was giving a good result. but the issue happens only when we put the site to real world.

This site was functioning pretty good on old server, so why would it change once we move to pantheon, just wondering what mercury profile does to the base system!!

mercury tunables

sreyas's picture

One more thing.. we had also trying what mercury tunables as per this post http://groups.drupal.org/node/70258

that is without changing anything from our side just used pantheon and the values suggested there.

Is PHP APC on? MySQL

Chris Charlton's picture

Is PHP APC on? MySQL Query_Cache ON? Is the database on its own server? Are you using MySQL or MySQLi to connect? Are most of your site users logged in or anonymous? Are you loading a lot of RSS feeds? Are you generating a lot of uncached dynamic feeds? Are you generating a lot of images/video, or encoding/transcoding media (on the same box)?

Just so you know, just because you are on a "cloud" doesn't mean that cloud has been setup well. You need to hire or pay for escalated support/consulting.

Chris Charlton, Author & Drupal Community Leader, Enterprise Level Consultant

I teach you how to build Drupal Themes http://tinyurl.com/theme-drupal and provide add-on software at http://xtnd.us

1) APC is turned ON 2) MySQL

manuj_78's picture

1) APC is turned ON
2) MySQL Query_Cache is ON
3) We do not have a seperate server for Database as that would increase the hosting cost, something we want to avoid
When we were on the VPS the site was on MySQL, however when we switched to Pantheon, the database switched to MySQLi and we suspect Mercury might be the reason for the automatica switch, however I might be completely wrong here on the cause for the database switching from MySQL to MySQLi
4) Most of the sites users are Anonymous
5) There are RSS feeds for Taxonomy terms (approx 100) for the site
6) No Video or media transcoding is done on the site, however a lot of the stories that we have tend to have around 2to 3 images associated with it.

I dont think we have ever said that our cloud server is setup well. And that was one of the most important reasons why we wanted to use Pantheon as a base as that has been created and tuned by experts in their fields.

The reason we are stuck at the moment is because we do not seem to have any error in any of the logs that have been enabled to help isolate the root cause of the issue.

did you have any component of

recrit's picture

did you have any component of Mercury on your initial VPS? varnish, memcache?
If not, then these might be mis-configured for your new cloud such that the cache (varnish or memcache) fills up and causes the crash. Installing munin could help inspecting.

No as at that time i was the

manuj_78's picture

No as at that time i was the only one managing the site. I was following the developments of pressflow and varnish etc but never gathered the courage to actually install and use them

We have infact disabled Varnish and Memcache from the cloud server but are still getting the crashes :-(

Will ask Sreyas to install Munin on the site and monitor.

Is there anything else that can be done interms of tools we can use to help trace what is causing the Apache connections to not close?

Not really; you need to

hunvreus's picture

Not really; you need to define where things are happening. The crashes seem to be cause by some of your code; I'd have a look first at your custom code (try benchmarking the site with another theme, try disabling some of the custom modules) and look for well known performance problematic contributed modules.

One way or another, you're going to have a hard time solving this one if you don't have somebody who has a good grasp of Drupal and the server configuration.

--
Wiredcraft (http://wiredcraft.com) - Building Web and mobile solutions using Open Source technologies.

I recommend profiling your

dalin's picture

I recommend profiling your site to see what portions of the page generation are problematic. For profiling I use a simple setup with xdebug and webgrind.

--


Dave Hansen-Lange
Director of Technical Strategy, Advomatic.com
Pronouns: he/him/his

XHprof is also good. Check

brianmercer's picture

XHprof is also good. Check out http://drupalperformanceblog.com/drupal-xhprof-profiling

There's a Drupal6 module and xhprof is integrated into devel for D7.

Also, I packaged the PECL extension as an Ubuntu package for easy installation if you're on that distro: http://groups.drupal.org/node/82889

Site behaving good

sreyas's picture

Hi everyone,

Site has been bahaving good for last couple of days..

I have removed boost, apachesolr, memcahe, varnish etc and ran update.php

Now I will be installing one module at a time review it for a day or two and then move to next one so that I can find out where the problem starts.

regarding mod_ruid

sreyas's picture

One more thing..

I have the mod_ruid installed on the server. And the web server is support to run as drupaluser. But I still see many webserver process running as apache(which is server default). Can anyone give an insight on this??

Boost

hassan2's picture

I think, the problem is boost. Check boost Module weight. It is usually -90 which means that it has higher weight than any other process in your durpal. Change it to 0 or may be 1.
Hope this helps.