I use “check_ping” to monitor the status of serveral computers. To identify those computers I use their DNS name. I just found out that when my DNS server goes down, the check_ping plugin generates an “segmentation fault”. When this happens, the check exits with a code ‘0’ (zero). Nagios interprets this as a “Host ok”-code. As a result, all hosts in Nagios appear to be up, as soon as the DNS server is down.
Does anyone else have this problem and if so what did you do about it? Of course I could rewrite the “check_ping”-plugin, but I was wondering if anyone has already found a suitable sollution.
Yes, a lot of hosts get their IP dynamically. So checking it by IP isn’t an option.
And I agree that it would be the least of my problems, but it would be nice if it could be prevented. That way you perhaps can obtain more information about the overall network integrety. Perhaps DNS isn’t working because a switch is down or something like that…
I have a similar issue myself - it appears that although I have 2 dns servers in my resolv.conf if one is in the process of restarting it accepts the query but times out (microsoft dns on win2k3) - in my setup i think this shows all hosts as down.
I’m thinking of running a local copy of BIND on my nagios server unless someone has better ideas…? Edited Tue Dec 20 2005, 01:03PM ]
MarkJ, your problem is the opposite of what I’m facing. When my DNS servers go down, all host appear to be up because the check_ping exits with a code 0 (for segmentation fault).
But can’t your problem be solved by configuring the nameserver to stop accepting query’s before it has completed its restart? I have absolutely no expierence with MS DNS, so I’m just guessing…
you could put the switch as parent of the DNS.
At his point you could think of making other checks dependant from the DNS service…
not sure if this works but it’s worth a try.
If it works and you don’t reach the DNS it is in unreachable state and you could choose not to get notified because you have a critical status further up in the nagios tree… the switch problem…
Luca
PS: There have been other discussion about mopnitoring dynamic IP machines…
PPS: While we are at it… if you often have problems on your DNS (as it looks because of this thread) you should possibly think about a backup DNS… Edited Tue Dec 20 2005, 05:42PM ]
Switches of course don’t have dynamic IP’s.
I don’t have that many problems with DNS actually, but I would like to deliver a monitoring system as good as it can get. There is already a failover DNS/DHCP server in the LAN.
I could try to implement the dependencies. It would be a lot of work, but it could work. If I’ve decide to do this, I’ll let you know if it helped…
I’ve read the other thread…
[quote=“maarten”]…
I use “check_ping” to monitor the status of serveral computers. To identify those computers I use their DNS name. I just found out that when my DNS server goes down, the check_ping plugin generates an “segmentation fault”. When this happens, the check exits with a code ‘0’ (zero). Nagios interprets this as a “Host ok”-code. As a result, all hosts in Nagios appear to be up, as soon as the DNS server is down.
[/quote]
I just simulated this by setting my “address” in hosts.cfg to “test”, which is not going to resolve. Nagios then gave this result “unknown” and this for an output:
"check_ping: Invalid hostname/address - test"
So, you need a newer check_ping plugin.
So, if you do get a newer plugin, then you should be good to go, as far as that goes. But, luca is right. If the DNS is down, then the other checks that fail due to the DNS being down, should be host “UNREACHABLE” and not “unknown” as it was with my test.
It’s not all that hard to define parents in the “hosts.cfg” file, so give it a try.
I’d suggest that you take a look at your “status map” link, and tell me truthfully, is that really how things are setup on your network? Do you really have 1 nagios machine and then all of those other hosts all connected to it? I don’t think so. I think you have a nagios host, with an ethernet card, that connects to a switch port, that is part of a switch. That switch may be connected to another port on another switch. Then, you may have a device that you are triing to monitor with the “check_ping” connected to a port on that switch.
Bottom line is, your status map should look identical to how your devices are actually wired together.
To make it simple, at least put your DNS server host as a parent for all of your “ping” test hosts. So, if a ping fails, nagios will then see that it has a parent called “DNS server”. Nagios will then check that parent and see that it is down, and notify accordingly. All of the checks that have DNS server as a parent will fail too, but with a status of “unreachable” and will NOT notify, if you have your notification options setup accordingly. In other words, DON’T notify me of any “unreachable” since I know that is not the problem. I want to know what is broken, so only notify me of “DOWN”, “recovered”.
Hmm… could be, although I downloaded the plugin not to long ago (I’ve only been using Nagios for about a month or so). I will try to find a newer plugin. I’ll let you know if I can find one an how it works out.