Alert only when both servers fail


#1

I’m monitoring two DNS servers which belong to our ISP. The primary dns server times out often and alerts whoever is on call. I’d like to setup some type of relationship so that both dns servers must fail before any notification is sent out. Any ideas?

Nagios 1.3 on Debian Sarge


#2

If it’s always your primary timing out, then it wouldn’t be too hard.
Just use the parent definition in your hosts file.
What will happen is this:
If a service fails on DNS-pri, nagios will look to see if it has a parent host. In this case, it will, and named DNS-sec. Nagios will then check DNS-sec to make sure it is up. If DNS-sec is UP, then nagios would normally send an alert, but in your case you don’t want this so you set notifications_enabled = 0 or set notification_options r,u or however you feel is needed.
The nagios webpage will show the device is down, but you won’t be alerted since notifications have been disabled for DNS-pri.
On the other hand, if nagios looks at the status of the parent (DNS-sec) and finds that DNS-sec is ALSO down, then a notification will be sent stating DNS-sec is DOWN. If you have notifications_enabled for DNS-pri and have notification_options r,u then you will be notified TWICE. Once for DNS-sec is “DOWN” and host DNS-pri is “unreachable”.

define host{
use generic-host ; Name of host template to use
host_name DNS-pri
alias DNS primary
address x.x.x.x
parents DNS-sec
notifications_enabled 0
}

define host{
use generic-host ; Name of host template to use
host_name DNS-sec
alias DNS secondary
address x.x.x.x
}


#3

The other option would be to identify the two DNS servers as a cluster and then ‘monitor the cluster as a collective entity.’ Check the docs for the HOW-TO:

nagios.sourceforge.net/docs/2_0/clusters.html

[Sweet, I’ve reached 100 posts on this forum! I’ve learned a lot since post number one. Many thanks to everyone for helping me along, esp. jakkedup…he’s like the Nagios guru.]
Edited Fri Jan 06 2006, 04:39PM ]


#4

I wrote a small shell script recently for checking dns propagation delay, I whittled this script down to the bare minimum to accomplish what you are wanting to do. In nagios you can configure something like:

define service{
host_name dns_servers
service_description DNS Servers

check_command check_dnsservers!4.2.2.1 4.2.2.2
}

Replace the IPs above with a list of your dns servers. You can use IP addresses or f.q.d.n names, or even just ns1 or ns2 if you have /etc/resolv.conf configured to search a default domain
. Then, add:

define command{
command_name check_dnsservers
command_line $USER1$/check_dnsservers.sh $ARG1$
}

And finally for the check_dnsservers.sh script:

#! /bin/bash

serverlist=($*)
servercount=${#serverlist@]}

outtext="Unknown error"
outcode=2

index=0
loopresult=0
badservers=""

while “$index” -lt “$servercount” ]
do

dig @${serverlist$index]} www.yahoo.com A +short 2>&1 > /dev/null

if “$?” = “0” ] ; then
let “loopresult = $loopresult + 1"
else
badservers=”$badservers ${serverlist$index]}"
fi

let “index = $index + 1”

done

if “$loopresult” = “$servercount” ] ; then
outcode=0
outtext="All DNS servers working"
elif “$loopresult” = 0 ] ; then
outcode=2
outtext"No DNS servers working"
else
outcode=1
outtext="DNS servers not responding : ${badservers}"
fi

echo "$outtext"
exit $outcode

I should add that having your primary dns server down is still a bad thing, it can cause poorly written apps/OSs to timeout or hang for some time before moving on to the next dns server. When users start calling complaining that webpages are taking forever to come up, I first start looking for a failed DNS server.

-mike
Edited Fri Jan 06 2006, 06:33PM ]


#5

Thanks for the hints. I think the service cluster is exactly what I was trying to accomplish. I figured someone had already implemented it. That’s the great thing about open source. :slight_smile:
Thanks to nagios my workplace actually knows the primary dns server is terribly unreliable. If only they had figured that out years ago. ;-p