Monitoring NT Windows Event Logs

MarkJ · December 9, 2005, 3:50pm

Hi,

Been using nagios a little for a few months now, but I want to start using it a little more seriously.

I’ve installed the NC_Net plugin on a couple of Win2k3 servers and I’m using it to monitor various event logs - currently in active mode, but i will change that soon.

I can see from the docs that I can filter out errors that I don’t care about (yes, I know there is not a keyboard on my server…).

So I get warnings in my nagios display and can get notifications if required - great!

However, if i configure to check the log for the last hour, at the end of the hour the warning clears.

Ideally, the warning would persist until I’ve looked at the log, resolved any issue and cleared the warning. Is this possible ?

Can this be done ? Do i need a different plugin ?

Thanks,
Mark

Edited ]

jakkedup · December 9, 2005, 4:44pm

nagios.sourceforge.net/docs/1_0/ … vices.html

Checking a log file for a “string” every 5 minutes like I do, would make this a volatile service check. In other words, you want to be notified each and EVERY time the check finds a NON-OK state.
But no, you can’t do what you wanted, since that would suspend checking. You really want the checks to stop, until someone has acknowledged and fixed the first problem? I don’t think so. It makes more sense to go ahead and check the log again, and simply notify you if it finds something since the last log check.

MarkJ · December 13, 2005, 5:43pm

volatile services sound ideal - I don’t care if checks continue after a problem has been spotted, I just need to ensure that the warnning doesn’t go away once it drops out of the bottom of the log (or rather the last 15 mins which I am checking.

So, I had a go at this with my current config, modifying my service to look like this :

define service{
use generic-service
hostgroup_name edmuk_dc
service_description Directory Service Event Log
is_volatile 1
check_period 24x7
max_check_attempts 2
normal_check_interval 30
retry_check_interval 3
contact_groups admins
notification_interval 960
notification_period 24x7
check_command check_win2k3_event!Directory Service!Error!15,0,0,0
}

Unfortunately once the error drops out the bottom of the log the nagios resets the state to OK.

Am I missing something, or does this only work with passive checks ?

I’m using 2.0b6

Thanks for your help.

jakkedup · December 16, 2005, 2:23pm

retry check interval with a volatile service? That’s not right. Think about it. We are grep’g a log file searching for a “string” If that string is found, alert me and give me a “Critical” in the display. Why would I then check the log again with a retry check? I know very well that the check is going to come back with “OK”, since log checks like an event log will not read the entire log over again. If they did, then you would be alerted each and every time the check was ran, since it would always and forever find what you are looking for at least once(I assume). Nagios log check/event log checks will remember the position in the log that they read up to the last time they checked. The next normal check interval the plugin will read the log again, but only from where it left off last time. Of course, if it finds “string” then alert me, and if not, don’t alert me. So,
max checks attempts =1
normal check interval = 30 (if you wish)
retry check interval = ?? don’t matter since we have max =1

MarkJ · December 20, 2005, 10:51am

This started off as an active check - I’ve now service active_checks_enabled to false.

However, the way that NC_Net event log checks appear to work is that you specify how far back in the logs to check (I’m now checking back an 8 mins period). Every check run within those 8 mins will report a critical, and then once the error falls out the bottom the check will respond with OK.

I was originally hoping to do as you suggest - is there something better I could use which works in the way you suggest on NT/Win2k3 ?

Otherwise I could set the interval for the passive check to match how far back I check in the log. If do this this, what happens if the passive check results don’t get back to the nagios server (ie a WAN interuption) ? Will i loose the alert ?

Anyway, at the moment the passive check is working and when an error appears in the event log, nagios show me a Critical and I get an email. However once the error drops out the bottom of the log, nagios clears the alert back to OK. Should this not, or can I not get it to persist until I manually acknowledge the problem ?

Thanks,
Mark

jakkedup · December 22, 2005, 5:11pm

[quote=“MarkJ”]…
If do this this, what happens if the passive check results don’t get back to the nagios server (ie a WAN interuption) ? Will i loose the alert ?
[/quote]

The alert is coming from your nagios server when you are using a passive check such as this. If your network is messed up, then you won’t be getting any status updates to the nagios server with ANY passive checks. Shouldn’t you be monitoring the network devices too? i.e. switches, spanning tree, routers, etc?

[quote=“MarkJ”]…
Anyway, at the moment the passive check is working and when an error appears in the event log, nagios show me a Critical and I get an email.
[/quote]

Great, that is just how it should work.

[quote=“MarkJ”]…
However once the error drops out the bottom of the log, nagios clears the alert back to OK. Should this not, or can I not get it to persist until I manually acknowledge the problem ?
[/quote]

Again, that would be the worst thing you could want. If it worked that way, then nagios would suspend making any checks of your log and would not email you of any NEW problems, until someone cliked the box “Acknowledged”. Absolutely not a good idea.

I don’t see what the problem is. Nagios is getting passive check results from a remote machine that looks at a log file. If a problem is found, nagios alerts you with an email and sets the status to “critical”. Upon the next check in 5 minutes, the log check passes with not trouble and the status is set to “OK” Since it’s a volotile service, even if the check is found to be “critical” again, you will again get an email.

I have many log file checks and they all behave that way. Nobody in there right mind is going to be looking at the nagios website for errors such as this. This type of service check is transient and the ONLY way to handle it is with the email notification.

Again, if email is down, or the network is down, then the least of your troubles is some logfile check. You have much bigger problems at that point and should have been checking the network with nagios so you would know exactly what is broken, or what network drop cable is unplugged.

MarkJ · January 4, 2006, 5:38pm

Hi,

Hope everyone had a good new year - now back to the work…

MarkJ wrote: …
…
If do this this, what happens if the passive check results don’t get back to the nagios server (ie a WAN interuption) ? Will i loose the alert ?

Jakkedup Wrote:…

The alert is coming from your nagios server when you are using a passive check such as this. If your network is messed up, then you won’t be getting any status updates to the nagios server with ANY passive checks. Shouldn’t you be monitoring the network devices too? i.e. switches, spanning tree, routers, etc?

Yes, we are monitoring all the network switches and VPN’s etc for failures. What I want to avoid, is a brief network issue from preventing other issues being identified and looked at. I guess the way to do this is to collect all the Nagios passive checks into a single location on each site and get these sent back to the single site. I guess I should see if NC_Net retrys sending passive results if the 1st attempt fails…

MarkJ wrote: …
…
Anyway, at the moment the passive check is working and when an error appears in the event log, nagios show me a Critical and I get an email.

Jakkedup Wrote:…

Great, that is just how it should work.

MarkJ wrote: …
…
However once the error drops out the bottom of the log, nagios clears the alert back to OK. Should this not, or can I not get it to persist until I manually acknowledge the problem ?

Jakkedup Wrote:…

Again, that would be the worst thing you could want. If it worked that way, then nagios would suspend making any checks of your log and would not email you of any NEW problems, until someone cliked the box “Acknowledged”. Absolutely not a good idea.

[Agreed - markj]

Jakkedup Wrote:…

I don’t see what the problem is. Nagios is getting passive check results from a remote machine that looks at a log file. If a problem is found, nagios alerts you with an email and sets the status to “critical”. Upon the next check in 5 minutes, the log check passes with not trouble and the status is set to “OK” Since it’s a volotile service, even if the check is found to be “critical” again, you will again get an email.

I guess my issue here is partly due to the signal to noise ration in my inbox. If i have a Active Directory replication problem over the weekend I dont want to get a few hundred emails about it. I would like to be able to switch of notifications over the weekend (for some servers) and be able to check the nagios status page to see whats broke.

Jakkedup Wrote:…

I have many log file checks and they all behave that way. Nobody in there right mind is going to be looking at the nagios website for errors such as this. This type of service check is transient and the ONLY way to handle it is with the email notification.

Perhaps the problem is that I’m not in my right mind! Specific services running on specific hosts I can check - ie that my SMTP server is working or that DNS responds. What I’m looking to catch is any errors in the log (filtering out the chaff) and for Nagios to display that something was not quite right ie that my Active Sirectory was decided to stop syncing (it appears to only tell you this once). In other words I would like some visual clue in Nagios that someone should investigate the log in case we have a problem - this is an issue that won’t clear itself, but will stop complaining in the log.

Jakkedup Wrote:…

Again, if email is down, or the network is down, then the least of your troubles is some logfile check. You have much bigger problems at that point and should have been checking the network with nagios so you would know exactly what is broken, or what network drop cable is unplugged.

True, however some of these issues (ie VPN’s going down) resolve themselves by the time I wake up - I still think I want to be able to see whats broken with a quick look at a status webpage rather than trawling through my inbox.

It can’t just be me that thinks it’s easier to have a quick look at a webpage than find the important emails in amongst the spam in my inbox ?

Thanks for your help,
Mark

jakkedup · January 5, 2006, 6:10pm

Re: "If i have a Active Directory replication problem over the weekend I dont want to get a few hundred emails about it. I would like to be able to switch of notifications over the weekend (for some servers) and be able to check the nagios status page to see whats broke."
My Reply
You could schedule to turn off notifications over the weekend, or you could define a check_period named Not_on_ weekends and put that in services.cfg for the check_period directive.

Re:“In other words I would like some visual clue in Nagios that someone should investigate the log in case we have a problem - this is an issue that won’t clear itself, but will stop complaining in the log”

My reply:
I see no other way that to have alerts sent by email. I suppose you could use a log checker that will not remember that it has already read the log up to point X. That way, it will read the entire log again, and of course, it will find the ERROR again, and again, it will show “critical”. In that kind of setup, you would then have to clear the log file of the “ERROR”, in order for the nagios to stop showing that it is CRITICAL. that wouldn’t be to hard, since you could simply go to the windows box, and clear out the log file.

Another thing you could do is to not use NC_net, or a mini nagios setup. But rather, scedule a check_log using the remote systems cron or windows scheduler. You could modify a log file plugin, to ONLY excecute the send_nsca script or the submit_check_result script in the event that “ERROR” was found during the log check. If “ERROR” is not found, since it found it 20 minutes ago, then your modified plugin will simply not submit anything at all. So, your nagios website will show “Critical” forever, once an error has been detected. It will not go back to OK, unless you personally, and manually, submit a passive check result of OK, by using the CGI interface.

Edited Thu Jan 05 2006, 07:13PM ]

MarkJ · January 12, 2006, 2:49pm

Firstly, thanks Jakkedup for sticking with me this long !

jakkedup:
You could schedule to turn off notifications over the weekend, or you could define a check_period named Not_on_ weekends and put that in services.cfg for the check_period directive.

Which would mean I would loose notification about any issues over the weekend - not quite what i’m after…

Jakkedup:
I see no other way that to have alerts sent by email. I suppose you could use a log checker that will not remember that it has already read the log up to point X. That way, it will read the entire log again, and of course, it will find the ERROR again, and again, it will show “critical”. In that kind of setup, you would then have to clear the log file of the “ERROR”, in order for the nagios to stop showing that it is CRITICAL. that wouldn’t be to hard, since you could simply go to the windows box, and clear out the log file.

It could work, but not ideal as it’s useful to have the log for future use sometimes.

But hang on, I think your onto something here…

Jakkedup:
Another thing you could do is to not use NC_net, or a mini nagios setup. But rather, scedule a check_log using the remote systems cron or windows scheduler. You could modify a log file plugin, to ONLY excecute the send_nsca script or the submit_check_result script in the event that “ERROR” was found during the log check. If “ERROR” is not found, since it found it 20 minutes ago, then your modified plugin will simply not submit anything at all. So, your nagios website will show “Critical” forever, once an error has been detected. It will not go back to OK, unless you personally, and manually, submit a passive check result of OK, by using the CGI interface.

I think this is the way forward! Problem is that I’m going to end up running code I’ve hacked on our live servers - always a bit worried when I run anything other than shell scripts that i’ve hacked myself!

Unfortunately although we have some dev staff here, network monitoring doesn’t really have a budget (is t the case everywhere ?) so i’d best go find a windoze c compiler. Should be a laugh

Anyway, once again thanks for your input - I shall report back if I ever produce anything useful!

Regards,
Mark

jakkedup · January 12, 2006, 6:04pm

re:" Which would mean I would loose notification about any issues over the weekend - not quite what i’m after… "
You told me you wanted them off for some services, so you don’t get thousands of them, so turn them off. Do you are do you NOT want them off?

re:" It could work, but not ideal as it’s useful to have the log for future use sometimes. "
Just make a copy of the log before you clear it.

MarkJ · January 13, 2006, 5:37pm

I don’t want the notifications to come spewing out over the weekend (ie when my replication moans every 5 mins), but I do want to know that there was a problem. If notifications are switched off over the weekend (and I’m using a volatile check) i would loose all visability of the fact there had been a problem.

re:" It could work, but not ideal as it’s useful to have the log for future use sometimes. "
Just make a copy of the log before you clear it.

Fair comment.

Anyway, looks like i’ve found a solution in Nagevlog steveshipway.org/software/ which finds new messages in the windows event logs and posts them to nsca (or in my case NC_Net)

Now, once I’ve looked into the error I can manually submit a passive result through the nagios front end to clear the warning.

Now my only wish is that if everytime a passive result is received then the Current Attempt would be able to increment past max_check_attempts - only want this for informational purposes though.

Anyway, thanks for your help.

jakkedup · January 13, 2006, 5:47pm

max_check_attempts=1 with a passive check such as a log check. Do you really want the log checker to look again to make sure that it did actually find the words “something is busted” all over again? I think once should suffice. I dont understand why you would want to check the log again.

MarkJ · January 13, 2006, 5:55pm

Yep, I’ve got my max_check_attempts set to 1 and am using a passive check.

Was just thinking that if another passve check was received with an error it would be nice if there was some way of knowing by seeing a counter, ie so you know how many passive checks have been received in the current state.

Perhaps it’s not really very useful really - just though it might be interesting.

jakkedup · January 13, 2006, 5:57pm

So is Nagevlog going to submit to nagios every 5 minutes that it checked the log? Or is it only going to submit when it finds the words “whatever” in the event log? If the latter, then you will get what you wanted. i.e. That the nagios cgi page will STAY in warning mode and also you will only get ONE notification. But if it reports to nagios every 5 minutes, then your website cig page is going to be reset to “OK”.

Let me know, it looks like it might work and would be great for others too.

jakkedup · January 13, 2006, 6:02pm

[quote=“MarkJ”]Yep, I’ve got my max_check_attempts set to 1 and am using a passive check.

Was just thinking that if another passve check was received with an error it would be nice if there was some way of knowing by seeing a counter, ie so you know how many passive checks have been received in the current state.

Perhaps it’s not really very useful really - just though it might be interesting.[/quote]

I think I get it. So if the remote host sends you 10 alerts from reading the log file, that you will see a counter like “10 alerts since last “OK” state” or something like that.
To see that type of info, use the “View Trends For This Service” link in the service detail screen.

MarkJ · January 13, 2006, 6:20pm

Nagevlog checks the log every 1 min (in my setup) and reports every time it finds an event of type Error (or rather that matches the configured filter). If no errors are found it stays silent (so nagios will remain in the warning state), and if multiple errors are found it will report them in turn (hence my question about a counter).

So, in my setup i will disable email notifications over the weekend and switch them back on Monday at 8am, yet I can still look at the nagios status page to see if we had any log errors over the weekend.

For info it works using filters which you can apply for one or all the logs and you can filter on event type (eg warnings, errors info or audits) and then it can match on (or exclude) eventid, event source or a string in the event text.

Things to watch out for are if you have a different filters for warnings and errors reporting to the same service in nagios (so nagios can show a warning on a warning and critical on an error). If you get an error and then a warning in your event log, nagios will go to error then warning which is not ideal.

Still, appears to be a nice app (runs as a service) and a good solution to my requirement - if only I’d spotted it a little sooner I could have saved us both some time talking about it

jakkedup · January 13, 2006, 6:35pm

But the important thing is that we have both and maybe others have found another nice little plugin. It sounds like it would help me out too, since the log checker I am using only reports what it found, but if it finds it more than once, it will report a (#somenumber) and then “last event found line”. Unike the one you found, that will show every instance it found.