Best Practice use of Nagios and Munin

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
vegantriathlete's picture

I've got Nagios and Munin up and running for my infrastructure. Now I've got to figure out how to best utilize those tools so that I'm not wasting system resources in the process of monitoring the system!

I searched on "nagios" and "munin" to see if there were any posts that already addressed this question. I didn't find any that answered my specific questions. But, for the sake of starting out by adding some value with this post, here is the list of things that are at least relevant:
http://www.guidelightsolutions.com/blog/enhancing-nagios-monitor-drupal-...
https://docs.google.com/a/isaacsonwebdevelopment.com/presentation/d/1wT5...
http://2bits.com/articles/presentation-monitoring-drupal-using-nagios-in...
https://drupal.org/project/nagios
https://drupal.org/project/munin

I've got a basic Nagios monitoring setup in place. See the attached screenshot. I check for the number of logged in users, because I've got my servers set up with OpenVPN and there really should not be more than one user (me!) logged in at any point in time. I've got it set to send a Warning at 3 users logged in and Critical when 5 users are logged in. [I'm not really sure of the value of checking the number of running processes. But, this was one of the plugins that was being used by default.]

My real questions are how to best use Munin (given the fact that Nagios is already in use).

  1. Out of the box there are a ton of graphs being created and I don't know how to even interpret most of them. I feel like I'm definitely using up a bunch of resources to generate those graphs without getting any ROI. What is the point of the various graphs? Which ones do you use? How do I pick and choose which graphs I want to have generated? (I have been reading through the documentation I can find online on Munin and still have not come across answers to these questions. It looks like the answer to picking and choosing is a matter of adding and removing sym links from /etc/munin/plugins.)
  2. What additional benefit beyond Nagios does Munin provide? It seems that there is definitely some overlap, as from what I understand Munin can also send alerts. However, also from what I understand, it is common to run both Nagios and Munin. I am thinking that Munin provides more opportunity to look for trends in the graphs. But, I have not yet generated any of the reports that Nagios provides, such as the Trends report. Am I correct in thinking that Nagios provides monitoring on-the-spot, while Munin can do more to keep track of historical data?

At the moment, I feel that Munin is basically duplicating the things I'm already doing in Nagios and I'm wondering if I should be using Munin.

Finally, what types of things do you feel are truly important to monitor (whether via Nagios or Munin)? Off the top of my head, I would suggest CPU Load, Disk Utilization and whether the server is even up.

AttachmentSize
nagios-monitoring-setup.png59.62 KB

Comments

Different tools

kbahey's picture

They are different tools.

Think of it this way:

Nagios will alert you (via email or SMS) when something is down and needs attention NOW.

Munin is for resource usage over time, and useful for things like: Varnish is not caching as much since so and so date, what did we change to cause that? Or, we now have increased CPU usage. Why is that? It allows you to compare day, week, month and past year. Nagios does not provide that, just alerts when thresholds are crossed.

Use them both ...

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

vegantriathlete's picture

But, I'm still confused about Munin's ability to send alerts. That would seem to indicate that it can be used in place of Nagios. However, from the initial reading I've done of Munin documentation, I recall seeing something about Munin sending some message (maybe this isn't the correct technical term) to Nagios.

Seldom used

kbahey's picture

Yes, in theory you can do that, but I have not seen it actively used.

Even the Munin alerts page mention Nagios integration for alerts as one option.
http://munin-monitoring.org/wiki/HowToContactNagios

Don't overcomplicate your setup without a good reason

Drupal performance tuning, development, customization and consulting: 2bits.com, Inc..
Personal blog: Baheyeldin.com.

How do you like to change the default installation?

vegantriathlete's picture

Just FYI, I installed munin via apt-get install. The installation process does create a bunch of symlinks in /etc/munin/plugins as is logged in /var/log/munin/munin-node-configure.log.

So, having received confirmation of my understanding to my question #2, I think that the outstanding questions (which, perhaps, are really the same question) from this post are really:

  • What do you typically monitor on your servers (whether via Nagios or Munin)?
  • Which munin plugins do you enable and why?

graphs and alerts

Slurpee's picture

Graphs can be very important. As a sysadmin turned Drupal developer, I'm always curious why many projects don't have graphing or alerts. How the heck do you know if anything is wrong without clients telling you? I prefer running monitoring software on my own in Cacti and Argus, but Munin and Nagios work too. Plus New Relic is getting better for those not as technical with a budget. I simply wanted to mention Argus as an option as it is easy for command line sysadmin nerds to setup custom alerts (http://argus.tcp4me.com/).

Graph data becomes better over time. Example, it is always easy to look at memory usage to see if it is "truly time to upgrade to a new server with more memory" or how little Ethernet traffic is actually used. Plus this provides evidence to novice developers trying to blame hosting infrastructure requesting to buy a bigger server with more memory. More importantly you can monitor database and how your database is running. After you review this data, bigger projects such as "moving to a larger server/s" and "database optimization tweaks" become easy to see before issues arise. Or graphs with alerts might randomly reveal issues you never knew about such as backup systems running in loop taking the site down nightly (yes, true story).

Alerting in various methods helps you diagnose the problem faster. Yes, it is great to know something is wrong, but what is wrong? If server is down you call host, but if it is a system problem you call developer. Basic ping to the IP will ensure the server is up, but you can monitor pretty much any type of service. Example, monitoring port 80 and/or the www service specifically. If 80/www fails, you might have a problem with Apache/webserver. Another example is creating a mysql account with privileges to view Drupal database. If you can't login, probably something wrong with mysql. Worried the webform might stop sending important contact messages? Setup an alert to monitor the smtp service. On a more server specific level, why not have an alert to monitor memory usage? Example, if memory is greater than 75% usage after 3 test within 3 hours, send an alert. Then it is time to start looking at your memory graphs. Then you can start looking at more graphs to start figuring out why memory usage is so high.

Sending alerts is another story. You can do email, but SMS is better as these alerts can be important. At the same time, you don't want to setup too many alerts as you will stop paying attention to them ultimately missing something important. Plus you can setup a tiered system for alerts. Maybe you want basic support to see the alerts for the first 5 minute intervals as email, then 30 minutes later as SMS sent to managers, then email/SMS alerts sent to bosses after 1 hour if not acknowledged.

Disclaimer...there are TONS of alerts and graphs with various methods of configuration you can utilize. You can even create custom graphs. We starting creating a few graphs related to Drupal such as node/user stats (https://drupal.org/project/cacti). This is simply the ramblings of a sysadmin turned Drupal developer ;)