Possible alerting bug

Palermo · February 11, 2009, 6:10am

Hi,
I found something that’s looks like a bug.
All this starts with the upgrade from nagios 3.0.5 to 3.0.6. it has some security bug fixes so the was a nessesery upgrade.
After the successful upgrade i always preform a test to see if the alerting is still working.
I don`t know no other way than to change a not existing IP address of a switch in the config file (Of course is not realy an inportant switch)
After a wile it detecting that the switch is not available anymore. It takes 10 min. And than its start to alerting.

For alerting i use sms_client. The following log in nagios shows up:

[02-09-2009 13:44:30] Warning: Contact ‘nagiosadmin’ host notification command ‘/bin/sms_client -q kpn:06xxxxxxxx “VDXgro PROBLEM: SWITCH01 is DOWN:”’ timed out after 30 seconds
[02-09-2009 13:43:59] HOST NOTIFICATION: nagiosadmin;SWITCH01;DOWN;host-notify-by-pager;CRITICAL - Host Unreachable (10.xxx.xxx.xxx)
[02-09-2009 13:43:59] HOST ALERT: SWITCH01;DOWN;HARD;10;CRITICAL - Host Unreachable (10.xxx.xxx.xxx)

In the sms_client log it shows this

Feb 09 13:43:59 [19656] : Dialing SMSC 06xxxxxxxx…
Feb 09 13:44:04 [19656] WARNING: read() Timeout
Feb 09 13:44:26 [19656] : Connection Established.

That it. no more than this. It does not hangup the line. Nothing happens. I need to reset the modem to get it working again.

When i send a messages via the the Linux command line, like this: it all works fine.
sms_client kpn:06xxxxxxxx "test"
And the log of sms_client:
Feb 10 10:50:23 [10353] : [000] kpn:06xxxxxxxx "test"
Feb 10 10:50:23 [10353] : Dialing SMSC 0653xxxxxx…
Feb 10 10:50:28 [10353] WARNING: read() Timeout
Feb 10 10:50:51 [10353] : Connection Established.
Feb 10 10:50:57 [10353] : Hangup…
Feb 10 10:51:00 [10353] : kpn Service Time: 37 Seconds

I tested all kinds of thing like changing the config file (COMMANDS.CFG)
Nothing works.

Than i decided to rollback the version from 3.0.6 to 3.0.5 again.
My problem was solved. This is the log:

[02-10-2009 12:27:50] HOST NOTIFICATION: nagiosadmin;SWITCH01;DOWN;host-notify-by-pager;CRITICAL - Host Unreachable (10.xxx.xxx.xxx)

Sms_client log:
Feb 10 12:27:50 [20455] : Dialing SMSC 0653xxxxxx…
Feb 10 12:27:55 [20455] WARNING: read() Timeout
Feb 10 12:28:17 [20455] : Connection Established.
Feb 10 12:28:23 [20455] : Hangup…
Feb 10 12:28:26 [20455] : kpn Service Time: 36 Seconds
Feb 10 12:28:26 [20455] : [000] kpn:06xxxxxxxx "VDXGRO PROBLEM: SWITCH01 is DOWN: $"
Feb 10 12:28:26 [20454] : Total Elapsed Time: 36 Seconds

My best guess is the there may be something wrong with the execution command…
Does anyone has the same problem.

immaginos · March 4, 2009, 5:32pm

There are at least a couple of us having the same notification issue with email alerts (see Weird Notification Problem).

It seems like the “printf” output (i.e. message body) isn’t making it into the message delivery system (sendmail in my case, sms client in yours).

This is an issue with both 3.0.6 and 3.1.0 (I upgraded to testing in hopes it would help - no change).

Anyone else have any thoughts on this?

Palermo · March 6, 2009, 12:52pm

Hmm… so there is no solution yet…
Could we add this under the atention of the maker of nagios?
I gues this is a serious error. This means that i cannot upgrade to the new version 3.0.6 or 3.1.0

Strides · March 6, 2009, 1:30pm

You could try the reporting the issue to the Dev mailing list…

[quote]
nagios-devel ([email protected])
This is a list for people with an interest in working on Nagios addons, event handlers, etc. It is also a place where you can report problems with the “core” code and suggest ideas for future enhancements. [/quote]

nagios.org/support/mailinglists.php

Also, the 3.1.x release train is, by all accounts “the first development/testing branch of the Nagios 3.x code” and “ODD minor versions of Nagios releases (e.g. 3.1.x, 3.3,x) will be considered experimental/testing” ( Nagios 3.1.0 Available ) so fingers crossed for v3.2.0

HTH

/S

immaginos · March 6, 2009, 4:28pm

I agree, it certainly should be reported - if we can figure out what the problem is. I don’t see much point in just submitting a “hey, this doesn’t work” report - especially since, based on my research so far, it appears that such reports are greeted with a “fix your sendmail”[1] response. I would really like to be able to point to a specific case where fail can be reproduced consistently … but I don’t have enough data about what is going wrong. I was hoping to find some more information by posting here and getting others to come forward.

I have deployed/managed several Nagios installs in the last couple of years, and the (ubuntu) exim4 and (fedora8) postfix installations work great - but they weren’t deployed on an already-working, correctly configured (fedora10) sendmail host with multiple relays. However, since Nagios is the only thing not delivering mail, I have to presume the external command pipe is to blame. I have MILTER output that seems to confirm this in detail, if anyone is interested, and I will probably go ahead and post it to the dev team next week if I cannot find any more details.

Is there a good way to monitor the actual i/o of the command pipe? Like tee it to another file to see what is actually being written?

Thoughts?

–
immaginos

[1] which is, on the face of it, a valid response. Unfortunately, it seems to have become knee-jerk.