Help! Conflicted between 24x7 vs. specific check period

We (our company) has a real dilemma with Nagios built-in functionality. We’re conflicted
between two courses of action.

We have processes that are started and stopped out of cron on a daily basis.
We want to monitor these processes during running hours, and alert ONLY during
running hours.

So one would think setting up a notification_period to match running hours would work.
The problem is with check_period behavior.

You see, we have a LOT of folks that look at our Nagios web page, and they DON’T want
to see red. (Or yellow, or orange) if there’s not a problem. This includes management,
and I’d like to please the people writing my paycheck. By setting check_period to 24x7,
you have checks looking for running processes that aren’t there at night, so the check
turns red and produces red entries in various places on the UI.

Alternatively, I am concerned about side effects of setting check_period to be the same
as notification_period. (And I remember reading warnings about this in the Nagios docs.)
Often times, a process needs to be manually stopped just a couple of minutes before
its official stop time. Nagios notices, and will flag it as down. Now, the next morning
comes along, and let’s say that there’s a fatal error in the start process. We would want
an alert right away; I think this would be possible if nagios viewed this situation as
a state change. Or, is there another way?

My ultimate question is this.

What is the best balance between “clean”, “correct”, “easy”, and “Use only built-in functionality” to solve the following problem - (I understand all 4 may not be possible)

We want to be guaranteed of (relatively) immediate alerts in the morning if our
processes don’t successfully start, but we don’t want “red pollution” anywhere in the UI
during times when A) Our processes are not running, and :roll: We don’t care that they’re not running.

On a longer-term note, I would love it if the devs would consider adding a state
to the UI / engine, which I would dub “Not checked right now”. Perhaps a
couple of config options to allow or prevent preservation of earlier state. Maybe a soft
blue color in the UI. Haven’t thought about long-term implementation / implications yet.

Thanks in advance for any feedback - it is greatly appreciated.

Hi!

good question!
We have exactly the same problem (or almost :)), and we “solved” it by writing a small script that is called before calling the nrpe (or sometimes after) and takes as argument the start and stop time of the check
=> if the current time is inside these bounds, the check is performed (the script takes all the arguments and calls nrpe
=> if the current time is outside these bounds, the check returns OK

==> here you go :slight_smile: you have a nice green page when outside the bounds; and the check is performed when it should be

as a side note: you’re right, nagios should have this option ^^

Thank you!! It’s nice to know I’m not crazy and that
other people are having this problem.