Notification is repeating when device is up / non-existant

jdrew · November 9, 2009, 10:12pm

Running Version 3.06.
Have notification interval set for 1440 (1 day).
I had a device that changed its IP address. Nagios noted it was “down”.
I altered the configuration to its new address. Device shows up again.
Device has ACTIVE checking for both Service (Ping) and Host (Ping)
In this instance, ping is the icmp check.

Nagios is sending an email once a day that the Host is Down.
I forced a Host check via the web page.
I restarted the service.
I have restarted the server.

A second host was pulled off the net.
I removed it from the configuration file.
Nagios is sending me an email once a day that it is down.
The problem has survived the service restart and the server restart.

I understand that if I were not doing a host check (only do the host check when the service check fails) of how this could occur when the device returns to service and the service check works.

I modified the template to NOT notify me when the host was down (let the service failure do the notification).
Still getting the host down email.

I am stumped.

luca · November 11, 2009, 3:12pm

did you use nagios reload?
try stopping nagios, cehck there are no running nagios processes, kill any survivors, start nagios.

i had similar problems using restart (which didn’t work correctly).

But still these problems shoudln’t survive a restart of the server…

jdrew · November 11, 2009, 5:54pm

Yes, I halted nagios and then started it via init.d/nagios start.
After the halt there were no instances of Nagios running.
And as you stated, if some rogue process were running amok, it should have died at the server restart.

Looking at retention.dat and status.dat, I can’t see a setting that would be triggering the system to think that it is still down.
I was thinking of stopping nagios and blowing away the retention.dat file, but one of the devices that notification is occuring
is NOT in retention.dat, status.dat nor in any of my config files. On that basis I felt that blowing away the retention.dat file
would not work since the one of the devices it is complaining about is not in that file. I am also reticent about losing the information
in retention.dat.

I read one account on the web (not this forum) where someone was experiencing the same type of problem, survived restarts.
He took his system down for a couple of days during a holiday break and when it came back, problem was gone.
I do not have that luxury, we run 24x7. No explanation of how that fixed his problem, it just did. I believe that was on a 2.x version.

Perhaps my next ploy is to turn off notifications in my templates, reload and wait 24 hours to see if any come out, and if not enable
notifications and see if it comes out after that.

luca · November 11, 2009, 6:25pm

what does the configuration page say? is the host listed there with the old ip or with the new one?

jdrew · November 11, 2009, 8:15pm

It has the New IP.
Device had been re-assigned an IP. When I became aware of that, I altered the config file and did a restart.

The second problem was that a separate device was permanently taken away. In this instance I removed the item from the config and restarted.
It’s this second problem that has me really troubled. The device name and IP is no longer in a config file, retention.dat, status.dat nor objects.cache.
So where does Nagios have this device that is no longer in the system (that I can find) and considers it down because it never saw it come up.

I can’t do a manual host check on this second device, it’s not in the system. Not that it would necessarily help, I did a manual host check on the device
that had its address change, and I still get my daily tickler that it’s down.

luca · November 11, 2009, 8:23pm

sorry, really no ideas left…

let’s hope somebody else comes up with something useful.

jdrew · November 13, 2009, 6:46pm

One more piece to the puzzle:

I disabled ALL notifications in nagios.cfg. I then restarted.
At the usual 24 hour timers (two different times for the two Zombied notifications) I still got the notification that the hosts were down.
Anyone got wooden stakes or silver bullets?

luca · November 13, 2009, 7:55pm

ok this may sound stupid… are you SURE it comes from your nagios server and not from some other test server? :mrgreen:

jdrew · November 13, 2009, 10:39pm

Luca, If you ever hit California, USA, I’ll buy you a beer.
Yes, a second server server that had become a test server was probably doing it.
Anyway to massage the from to be nagios-machinename@domain rather than just nagios@domain?
I’ll hunt around for that, but I’m sure you’ve found the problem. I’ll know by tomorrow morning.

luca · November 13, 2009, 10:52pm

I’ll have one out of my fridge in the meanwhile… :mrgreen:

That’s postfix (or whatever MTA you use) stuff… you have to check if you can rewrite the sender address, it can be done for sure even if you loose part of the data, i don’t think you want to write back to the nagios address - and the reply to could anyway be correct… there’s the difference between sender name and address… so it CAN be done… but don’t ask me how.

Glad you got it. Happy weekend