I posted this on the other forum site that now looks like has disappeared off the Internet, but I digress.
I have a Nagios installation that I migrated from 1.x to 2.x (2.5 to be exact) on Gentoo. I kept all the service/host/command configs and started fresh for nagios/cgi.cfg. I now have the problem that when a host is in scheduled downtime (slept), service recovery notifications will still be sent. This is annoying since when you sleep a host, you expect no notifications to come from it at all. The 2.x documentation even says this should be the case.
I tried this migration before (around 2.3) and it did the same thing then. After using the same configs (with the needed minor adjustments relating to contact groups), 1.x works properly.
So is this a problem with 2.x or something hidden in my configs or host/service templates?
Any clues are appreciated.
Try 2.6, and see if that fixes it.
I upgraded to 2.6… Still has the same behavior but I can expand on the problem…
it seems that services that have sent problem notifications will send recovery notifications even if the host is “slept” BUT services that haven’t sent problem notifications (i.e. the host was slept prior to the service going critical) will not send recovery notifications while the host is still slept.
Again, this is not proper behavior to my knowledge.
During the period of scheduled downtime, my understanding and experience is that no notifications will be sent. A change of state outside the period of scheduled downtime will result in a notification being sent. For example, a service goes down during the period of downtime, the period of downtime ends and then the service comes up - this results in a recovery notification being sent as the recovery happened outside the period of scheduled downtime.
Scheduled downtime does not affect whether checks are done, nor whether the status is displayed on the screen. It only affects whether notifications are sent or not (and I guess escalation).
That’s my experience anyway; I haven’t gone so far as to checking under the bonnet to see if that’s what the code is supposed to do.
I think I see the problem now. YOu have only scheduled downtime for the host. You also must schedule downtime for the services too. Try that and see if it works.
I just completed some experiments with this very problem using nagios 3.2.0 on RHEL5.2. Here are the characteristics that I have found.
Service A Critical hard state reached and corresponding notification sent
Downtime scheduled for Service A’s host until well beyond when the recovery occurred.
Service A Recovery hard state (Ok) reached and corresponding notification sent.
The next test I did was to verify that I didn’t need to schedule downtime for the service as well, so I repeated the above scenario except that I scheduled downtime for Service A and for Service A’s host. The results were the same. Finally, as a sanity test, I did the following…
Schedule downtime for Service A’s host (only host downtime) until well beyond the next set of tests.
Service A Critical hard state reached and NO notification sent
Service A Recovery hard state (Ok) reached and NO notification sent.
I still need to verify this on 3.2.3. I will do so and post results when I get to it. This begs the question though as to why a notification was sent out during a scheduled downtime window. The only difference here is that the service entered the non-OK hard state before the downtime began. If all events occurred within the downtime window, it worked exactly as expected.
In my opinion, this appears to be a bug in the code, but I’m willing to listen to reasoning as to why it would behave in this fashion. This appears to be a seldom encountered scenario, however, if you forgot to schedule down time for something until after you started maintenance, it’s kind of a pain to get these notifications while you’re in the outage window and there doesn’t appear to be any way to stop them unless you specifically disable notifications for the service.
I just confirmed this behavior with Nagios 3.2.3 as well.
Confirmed this on 3.3.1 too. I’ll submit a bug the the Nagios tracker.