High load average alerts

Posted by jimsmith on July 6, 2013 at 9:41pm

I am having trouble keeping my installation of BOA on a Linode VPS from running into system spikes that cause service interruptions. The troubles began around the time of an upgrade to BOA-2.0.9. I don't necessarily blame the upgrade. Actually, I'm not sure what to blame at this point, but it is clear that some process is spiking to the extent that it is pulling down my system.

It's my understanding that when the server reaches a certain load level it kills functions, then restarts them when levels return to acceptable levels. It appears that's what is happening and I'm wondering if it's happening because of database issues.

I'm only guessing because I don't know for sure where to look, so any assistance from any knowledgeable user would be most appreciated. I am not a sysadmin.

I received occasional emails indicating an SQL check ERROR, but I have not received one of those in a couple days. I am still getting other alerts, however, that indicate problems.

The down times seem to happen at various times but only for a few minutes. When the trouble happens I often get an email indicating a high 5 minute load average alert. The problem is I don't know how to interpret the information provided.

One I received this morning said the following:

Time:                    Sat Jul  6 04:19:40 2013 -0400
1 Min Load Avg:          184.21
5 Min Load Avg:          39.01
15 Min Load Avg:         12.90
Running/Total Processes: 2/223

A process status report was attached, which I have posted here.

If you can give me direction of where to look to find the source of the problem, that would help me a lot.

Thanks.

Comments

This usually happens on

Posted by omega8cc on July 6, 2013 at 10:03pm

This usually happens on Linode after upgrading RAM over 4 GB on a 32-bit system. But since you didn't provide any details, it is a pure guess, based on our experience with remotely supported BOA servers on Linode.

frequency

Posted by jimsmith on July 6, 2013 at 10:03pm

to give a better indication of the frequency of this problem, here is a list of recent high load average messages sent to me. Notice that many of them were sent around 4:20 a.m. Coincidence? An indicator of something else?

aegir:~# grep *LOAD* /var/log/lfd.log
Jul  1 04:19:03 aegir lfd[6825]: LOAD 5 minute load average is 27.10, threshold is 6 - email sent
Jul  1 19:45:13 aegir lfd[24851]: LOAD 5 minute load average is 9.35, threshold is 6 - email sent
Jul  2 04:18:41 aegir lfd[11393]: LOAD 5 minute load average is 12.09, threshold is 6 - email sent
Jul  2 16:26:26 aegir lfd[26212]: LOAD 5 minute load average is 21.99, threshold is 6 - email sent
Jul  4 04:19:23 aegir lfd[7561]: LOAD 5 minute load average is 6.69, threshold is 6 - email sent
Jul  5 04:19:00 aegir lfd[7238]: LOAD 5 minute load average is 7.31, threshold is 6 - email sent
Jul  6 04:19:40 aegir lfd[2282]: LOAD 5 minute load average is 39.01, threshold is 6 - email sent

You need to identify where

Posted by larsmw on July 6, 2013 at 10:11pm

You need to identify where the high load is. Is it mysql or apache/php? Then you can add som performanceoptimization and maybe som cache in form of memcached or varnish.

The /var/xdrago/daily.sh

Posted by omega8cc on July 6, 2013 at 10:14pm

The /var/xdrago/daily.sh script runs daily at 4:18 but it shouldn't cause that high load, unless there is a serious overload on the parent machine. If this is not a side effect of using >4GB RAM on a 32 bits system, then ask Linode to move your VPS to another machine.

@larsmw

Posted by omega8cc on July 6, 2013 at 10:15pm

There is no Apache, memcached nor Varnish used in the BOA stack.

RAM

Posted by jimsmith on July 6, 2013 at 10:31pm

It is a 32-bit system (on Debian Squeeze), but it is not running 4GB RAM.

aegir:~# cat /proc/meminfo
MemTotal:        1543532 kB

I'm experiencing the same

Posted by burgs on July 7, 2013 at 12:08pm

I'm experiencing the same issue after the recent Linode free RAM upgrade was applied. It happens at the same time every day, albeit only happening every now and again. I think I updated to BOA 2.09 at about the same time. I upgraded from 512M to to 1G RAM. I'm running Debian Squeeze on 32bit (I think).

backups

Posted by attiks on July 7, 2013 at 12:25pm

That's the same time the system starts making backups of all databases, I see a slight increase on my server as well, but not that high.

Move the VPS

Posted by omega8cc on July 8, 2013 at 12:06pm

I would strongly suggest to ask Linode to move your VPS to some other machine. We have seen this too many times - people wasting hours and days trying to figure out what the problem could be, only to see it magically fixed once moved away from noisy neighbors. Note that one migration may be not enough if you are migrated to another machine with another set of noisy neighbors.

Only if this will not help, continue with debugging - but since the load seems to be related to some cron tasks, it is almost for sure disk I/O and/or CPU power shortage - a typical sign of being hosted on a critically overloaded machine.

Nothing in the BOA itself can cause this and the only other possibility is the DDoS attack - but your host should detect it also without your help/report.

Thanks

Posted by jimsmith on July 8, 2013 at 1:38pm

Thank you for this advice.

lower mysql memory values?

Posted by jimsmith on July 8, 2013 at 7:05pm

It appears that the problem is the result of mysqld exhausting RAM. It is certainly consuming more memory than its share of processes.

Linode has published the following recommendations for values in my.cnf:

key_buffer = 16K
max_allowed_packet = 1M
thread_stack = 64K
table_cache = 4
sort_buffer = 64K
net_buffer_length = 2K

Compare to what is set by BOA:

key_buffer = 2M
max_allowed_packet = 32M
thread_stack = 256K
table_cache = 128
sort_buffer_size = 64K

Should I lower the values, as recommended by Linode? Will this help to rein in mysqld, or will it have a negative effect on delivering Drupal pages?

No, no

Posted by omega8cc on July 8, 2013 at 9:03pm

They obviously know nothing about Drupal. You will kill your Drupal sites if you will follow these unfortunate "recommendations". The values set by BOA are safe defaults and are an absolute minimum if you don't want to crash databases often and experience constant 502 Bad Gateway errors because of too tiny limits. Please ask them to move the VPS on a less crowded machine. And more RAM may also help if you are using all of it.

Note also that these BOA

Posted by omega8cc on July 8, 2013 at 9:10pm

Note also that these BOA defaults for mysql didn't change for months or even years. So if that worked before, it can't be a source of your problem now.

Just to walk back this point

Posted by jimsmith on July 8, 2013 at 10:18pm

Just to walk back this point for a moment, the defaults haven't changed, but is there anything else that might have changed? As I noted in my original post, the problems started around the time of my upgrade to 2.0.9. Taking a look at the disk IO graph provided by Linode, there is clearly a change that happens on that date. Am I wrong to read significance in this?

Only local images are allowed.

No

Posted by omega8cc on July 10, 2013 at 2:31pm

There were no changes on the BOA side which could cause issues like this.

~Greg

Thanks again

Posted by jimsmith on July 8, 2013 at 9:29pm

I'm glad to get confirmation of what I was thinking. Thanks.

same thing has happened to me

Posted by Anonymous (not verified) on July 10, 2013 at 12:39pm

The same thing has happened to me after upgrading barracuda to 2.0.9 stable - pretty much half mornings I have a high load warning alert, and I never had those before - nothing has changed on the server * (see below)- and (unfortunately) traffic to the sites hasn't increased either. The only thing that has changed on the server is the addition of solr 3 and 4 - and I have enabled BOTH 3 and 4., but am not even using them yet as there is too little content on the 18 sites I host to even use search on this very basic dedicated server with an atom single core 1.8Ghz CPU but 4 GB of ddr2 ram. the 2 To disk is probably very slow, and the high load warning alerts are in the morning around the time the db backups are supposed to be made.

BOA-2.0.9 didn't introduce

Posted by omega8cc on July 10, 2013 at 2:35pm

BOA-2.0.9 didn't introduce anything new which could cause issues like this. All maintenance scripts always put some extra load on the system, but nothing which could cause critical load. We don't observe anything like this even on the smallest machine we have.

~Greg

Well

Posted by omega8cc on July 10, 2013 at 2:29pm

The script referenced in this thread doesn't do anything new - it enables/disables some modules and fixes permissions. If it started causing problems, it is an obvious sign that you are hosted on a weak and/or overloaded machine with disk I/O and/or CPU shortage. Or Linode did something recently which drastically reduced your VM performance. While disabling/renaming/removing the /var/xdrago/daily.sh script can be a workaround if you don't need its functionality, it is still worth to note that it signals serious performance issues with the machine you are hosted on. We use the same script in all VMs on our systems and it never created any single problem, not even any visible load spike.

~Greg

csf restart with big csf.deny

Posted by naurisr on July 25, 2013 at 1:18pm

On one of my servers hosted on Linode I had this problem. I realized that the CPU was overloaded every two hours. After some research I found that during high load the csf firewall is being restarted (with command csf -sf) so I checked the csf configuration. Finally the main problem was that I have manually added all China IPs (about 3500 rows I got them from https://www.countryipblocks.net/country_selection.php) to csf.deny so it took some resources to load all the csf configuration on each csf restart. Now when I removed all those ~3500 rows from csf.deny my problem is solved on that server. The question is why there where no problems with csf.deny being large before BOA-2.0.9 upgrade and/or linode migration? I upgraded to BOA-2.0.9 and migrated linode approximately at the same time so I don't know what actually initiated the problem.

On another server I am managing I have problems only when daily.sh is executing. There is a high CPU load and all the sites are offline for 10-15min although the server is available using SSH. It looks that during "find" command the CPU is most loaded, but I don't understand why all sites are down. Can someone explain that and tell me how to solve that? This started after linode migration. Last night I migrated VPS to another server so I will see tonight if the problem still persists.

High load average alerts

Comments

This usually happens on

frequency

You need to identify where

The /var/xdrago/daily.sh

@larsmw

RAM

I'm experiencing the same

backups

Move the VPS

Thanks

lower mysql memory values?

No, no

Note also that these BOA

Just to walk back this point

No

Thanks again

same thing has happened to me

BOA-2.0.9 didn't introduce

Well

csf restart with big csf.deny

BOA

Group organizers

New groups

Group notifications