Logfile monitoring and acknowledgement handler

happyfrog · July 31, 2008, 3:30pm

Hello everybody,

I’m thinking about logfile monitoring with error persistence and manual handling. I mean that if error happens, it should be kept by plugin in error state until manual reset, because there is no other way for plugin or Nagios to know that error has been fixed (I suppose that logfile can contain only error messages).

In case of not-persistence, if next logfile check succeeds, service status will be OK, but initial error will remain unfixed. I really want to avoid this behavior.

So what kind the “manual reset” should be? It seems to me that “problem acknowledgement” is the best way to do that. But there should be some ‘acknowledgement handler’ to call some script, which, in turn, will signal check plugin to turn alarm off. At this time, such handler doesn’t exist, but it would be great to see it.

What do you think about that or possible other ways to achieve the same goal?

Loose · August 1, 2008, 8:56am

Hi!

I didn’t really understand the question, but here is (maybe) another way to monitor logfile (that’s the way we do it anyway):

In fact, most error logfiles will keep on logging errors as long as the problem is still there. So, our way to monitor it is to place a “clear threshold”: if the last error message is older than X minutes, this means that the problem has been solved.

(or, on some logfiles, when a technician solves the problem, he just deletes the logfile so a new one is created, empty of error messages :))

hope this helps a little bit ^^

happyfrog · August 1, 2008, 9:57am

Hi!

Yes, you got the point.
Timeout is not a full solution, because, as you say, an alarm will disappear after X minutes, but I wank to keep it forever. In practice, it should be reset by technician who solved the problem, after a hour, a day or a week.

It seems that logfile deletion is really simple and affordable solution.

Thank you.