I’ve setup a Nagios installation on Ubuntu for work.
It works really well, and everyone loves it.
So far I’ve got about 60 hosts and 150 services, and everything’s running mighty fine.
I’ve got a question though.
One of our servers crashed the other day. (W2K3, secondary DC).
There was an error message on the login screen, and the event viewer was filled with “delayed write” and “unable to write to registry” errors.
However, Nagios failed to alert us…
The server was somehow responding to pings, reporting its uptime, memory usage, and CPU usage.
So the host wasn’t actually ‘down’ according to Nagios.
The only warning was on the drive monitoring, because NRPE timed out (because the performance counters were unqueryable due to the failure).
This left me wondering: is there a plugin or check command that will actually test to see if a Windows server is truly up and running?
A check that would perform tests such as:
[list]testing a registry read/write
testing a disk read/write
checking to make sure there are no error messages on the login screen
checking that network shares are readable
Or am I just pushing it here?