Distributed monitoring

dierk · July 24, 2008, 7:02am

Hello,

I’ve setup nagios ditributed monitoring using the nsca plugin.
Everything is working ok, but there is one thing I can not figure out.

Normally when a service goes down for the first time (soft 1) the next check it will use the value of the retry check interval.
So if you setup a service with normal check interval 5 minutes en retry check interval 1 minute, normally it should take 7 minutes before an alert is send.
But in my situation it takes 15 minutes before an alert is send.
Anyone ideas ?

This is the log of the central nagios:
[07-24-2008 07:17:57] SERVICE NOTIFICATION: NB_mail;test;PING;CRITICAL;notify-by-email;CRITICAL - Host Unreachable (192.168.17.204)
[07-24-2008 07:17:57] SERVICE ALERT: test;PING;CRITICAL;HARD;3;CRITICAL - Host Unreachable (192.168.17.204)
[07-24-2008 07:12:57] SERVICE ALERT: test;PING;CRITICAL;SOFT;2;CRITICAL - Host Unreachable (192.168.17.204)
[07-24-2008 07:07:57] SERVICE ALERT: test;PING;CRITICAL;SOFT;1;CRITICAL - Host Unreachable (192.168.17.204)

This is the log of the nagios server who’s sending the check results to the central nagios server:
[24-07-2008 07:14:59] HOST ALERT: test;DOWN;SOFT;9;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:13:49] HOST ALERT: test;DOWN;SOFT;8;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:12:39] HOST ALERT: test;DOWN;SOFT;7;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:11:29] HOST ALERT: test;DOWN;SOFT;6;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:10:21] HOST ALERT: test;DOWN;SOFT;5;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:09:09] HOST ALERT: test;DOWN;SOFT;4;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:07:59] HOST ALERT: test;DOWN;SOFT;3;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:07:49] SERVICE ALERT: test;PING;CRITICAL;HARD;1;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:06:49] HOST ALERT: test;DOWN;SOFT;2;CRITICAL - Host Unreachable (192.168.17.204)
[24-07-2008 07:05:39] HOST ALERT: test;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%

dierk · July 24, 2008, 9:17am

I’ve done some more testing with the parameter freshness_threshold on the central nagios server.
If this option is enabled in the service definition with a value of 600 then the alert message wil be sent after +/- 15 minutes.

If I lower this value to 60 seconds then all the passive services are going down and up.

If I don’t use this value at all it works fine, but this is not the recommend setting for distributed monitoring.
[07-24-2008 10:22:13] SERVICE NOTIFICATION: NB_mail;test;PING;CRITICAL;notify-by-email;CRITICAL - Host Unreachable (192.168.17.204)
[07-24-2008 10:22:13] SERVICE ALERT: test;PING;CRITICAL;HARD;3;PING CRITICAL - Packet loss = 100%
[07-24-2008 10:21:53] Warning: The results of service ‘PING’ on host ‘test’ are stale by 0d 0h 0m 44s (threshold=0d 0h 1m 15s). I’m forcing an immediate check of the service.
[07-24-2008 10:20:03] SERVICE ALERT: test;PING;CRITICAL;SOFT;2;CRITICAL - Host Unreachable (192.168.17.204)
[07-24-2008 10:19:03] SERVICE ALERT: test;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100%