External command errors under moderate load


#1

Digest:

Under moderate load, nagios appears to either be reading the nagios.cmd file incorrectly, or nagios is failing to execute the command properly.

Details:

We are using nagios to monitor and alert on SNMP traps sent to the system. The method of doing this involves using an snmptrapd trap handle to send traps to snmptt for interpretation, which outputs the results to sec.pl, which ends up using a perl script to inject the PROCESS_SERVICE_REQUEST messages into the nagios.cmd file. (We are using a somewhat modified version of what is used here: samag.com/documents/s=9559/sam0503g/)

In general, this process works without any problems. The perl script that actually writes to the nagios.cmd file uses Nagios::Cmd, as follows:

my $fifo = ‘/var/spool/nagios/rw/nagios.cmd’;
mkfifo( $fifo, 600 );
my $writer = Nagios::Cmd->new( $fifo );

$writer->service_check($adjusted_hostname, $service_name, $severity_code, $data);

This seems to work fine, under general conditions. I even have the script writing the same data to a seperate “log” file, to verify what is getting added to the nagios.cmd file.

The problem comes in when we receive an excessive number of traps. Currently a couple of devices on our system send out routine alerts every 15 minutes. We don’t currently have the ability to disable this feature of the devices. So, every 15 minutes we receive 75 traps all at once. During this storm, we start getting some errors in nagios’ processing of traps, and apparently in its processing of other, active service checks (though these errors only seem to occur when nagios is in the middle of a trap storm).

The exact error we get is that generally, during the storm, one or more of the service checks will report the following sort of error:

[1131944430] SERVICE ALERT: device_name;service_name;CRITICAL;HARD;1;(Service check did not exit properly)

That is what appears in nagios.log. Oddly, those log entries are generally followed by these kind of entries:

[1131944430] Warning: Message queue contained results for service ‘s - TCA’ on host ‘’. The service could not be found!
[1131944430] Warning: Message queue contained results for service ’ - TCA’ on host ‘’. The service could not be found!
[1131944430] Warning: Message queue contained results for service ‘CA’ on host ‘’. The service could not be found!
[1131944430] Warning: Message queue contained results for service ‘A’ on host ‘’. The service could not be found!
[1131944430] Warning: Message queue contained results for service ‘A’ on host ‘’. The service could not be found!

Those messages appear to have to do with the service description, as the ‘TCA’ it is talking about appears in some of the trap service descriptions. I have tried to figure out what I can manually insert into the nagios.cmd file to reproduce these “Message queue contained results for service” errors, but so far I can’t reproduce it manually.

My thought is that, somehow, so much data is being written to nagios.cmd, so quickly, that some of it is being read mid-stream, before it is finished being written to the file. That seems like a long shot, but I haven’t thought of a better answer yet.

Sometimes, during the trap storms, other active service checks will also log the same kind of error, that the service check didn’t execute properly. However, for the service this was happening to we added a wrapper to the plugin, such that it would execute the actual plugin, and log the results before sending the results back to nagios. In every case where nagios complained that the service check didn’t execute properly, we showed the plugin was called, and executed without error.

At this point I’m at a loss to explain what is happening, unless there is some cap on what nagios can handle in a few seconds. In general, from the start of the trap storm to the end of it, 8 or 9 seconds will elapse. However, the delay appears to be in snmptrapd getting to the message, not to nagios processing it.

We have nagios configured to process the nagios.cmd file as often as possible. I don’t know if this is a problem, and if we should up the delay to 10 seconds, perhaps.

Does anyone have any ideas? Even small ideas would help, if you have any thoughts that could explain any aspect of what we’re seeing, especially those “Message queue contained results for service” messages.

Thanks,

-Dan


#2

It looks to me that you are right. Your scripts should perhaps be written to check to see if the file is already being edited. If so, then wait. I really suck at snmp traps, so all I can suggest is, is there another way to get what you need? I use check_snmp for 80% of my checks, and I have yet to find a reason to use traps.