Sunday, February 24, 2013

Nagios and Ganglia: check_ganglia_metric.sh Failing

At work, we use Ganglia to meet the majority of our metrics needs. One handy thing that Ganglia brings with it is a set of scripts and PHP pages that allow Nagios to check metrics in a variety of ways and alert as necessary.

Last week I stumbled across an odd situation - our primary datacenter's Nagios instance was reporting "null" for all of our Ganglia checks (most of which call check_ganglia_metric.sh). This was particularly bad, because the null values were considered "OK" by Nagios, which left us blind to some potentially harmful circumstances.

Short Version

Bump the amount of memory a single PHP script can consume by editing /etc/php.ini and looking for something like memory_limit = 128M. This is most likely why the check_metric.php page is choking and returning nothing.

Long Version

To see what was going on under the covers, I ran check_ganglia_metric.sh on each host to see what it did. One side ran fine and gave me legitimate Nagios-style feedback, the other gave me nothing at all. The shell script was simple enough - it just cobbled the passed arguments together into a URL that it then curl'd. The page it was curl'ing was check_metric.php, so that's where I dug next. After throwing in a bunch of extra debug statements, I was at least able to figure out where the PHP was dying:

check_metric.php



The if statement that checks the nagios_cache_time was the last successful statement in my debug output. So I started digging through to see where and how stale the nagios_cache_file was.

The default file name is nagios_ganglia.cache in your Ganglia install's conf_dir (check out your conf.php in ganglia-web's root directory). In my case, the file was a couple weeks old, which is considerably more than the default 45 second age it was configured for! When I looked at the working system in the other DC, I saw the file being updated multiple times per minute - more often than the 45 seconds configured in conf.php - so I began to suspect that the Nagios checks themselves were triggering the cache refresh. A quick check of /var/log/httpd/ganglia_error_log made it blindingly obvious:

[Thu Feb 21 03:23:26 2013] [error] [client 10.1.0.34] PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 1625427 bytes) in /var/www/html/ganglia-qd/nagios/check_metric.php on line 62

What that line does isn't terribly important at this point (serializing the data from the cache file), but what is important is that the global PHP memory limit is insufficient. Checking /etc/php.ini revealed: memory_limit = 128M

Evidently, that just won't do for a > 7M cache file. Bumping it to 256M and bouncing httpd did the trick!