We run a virtual hosting service and use Nagios to monitor both the virtual machines and the physical servers upon which they live.
If there’s a network outage we only get notifications about the physical servers being down because of how we’ve set up host dependencies, the VMs are merely unreachable. We pay per SMS for notifications so don’t want to overdo it.
If there’s a power outage the same thing happens, but when the physical servers come back it takes time to get the VMs back, and during that time Nagios picks up on the fact that the physical servers are reachable but the VMs are not and decides that the VMs are now in the DOWN state and therefore starts sending out all the notifications we didn’t want to get. Not great, and can get costly in terms of utterly pointless text messages.
I can see what Nagios is doing and why it’s doing it, but I don’t know how to configure so it doesn’t do it, i.e. recognise that those VMs being “down” is part of the same outage in which they were just “unreachable”. Anybody got any good ideas on how to do this?