Problem with distributed monitorization

granjerox · July 10, 2009, 6:57am

Hi everyone,

i would like to present you a problem that i’ve found that maybe you’ve allready
gone throug.

First of all i’ll say that the our OS’s are RHEL 4 Update 4 or Update 2 and that we are running
Nagios 3.0.3, Nagios plugins 1.4.11, NSCA 2.7.2 and NRPE 2.12.

The scenario is like follows :

SERVER <---- NRPE —> LOCAL NAGIOS <----NSCA ------> CENTRAL NAGIOS
SERVER <---- NRPE —> |
SERVER <---- NRPE —> LOCAL NAGIOS <----NSCA ------> ---------|
SERVER <---- NRPE —> |
SERVER <---- NRPE —> LOCAL NAGIOS <----NSCA ------> ---------|
SERVER <---- NRPE —> |
SERVER <---- NRPE —> LOCAL NAGIOS <----NSCA ------> ---------|
SERVER <---- NRPE —>

The NSCA communication between Nagios is achieved with an event_handler
that sends information on every service status change.

Now that i’ve explained the scenario, let me show the problem. That’s a
filtered /var/log/messages from the central nagios :

Jul  7 15:19:58 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;3;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!
Jul  7 15:20:08 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;0;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!

Jul  7 15:20:28 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth0 Ancho de Banda;1;WARNING on inteface eth0 : 0 % , Enlace a 100Mb/s
Jul  7 15:20:38 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth0 Ancho de Banda;0;WARNING on inteface eth0 : 0 % , Enlace a 100Mb/s

Jul  7 15:20:58 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;3;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!
Jul  7 15:21:08 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;0;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!

Jul  7 15:21:28 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth0 Ancho de Banda;1;WARNING on inteface eth0 : 0 % , Enlace a 100Mb/s
Jul  7 15:21:38 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth0 Ancho de Banda;0;WARNING on inteface eth0 : 0 % , Enlace a 100Mb/s

Jul  7 15:21:58 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;3;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!
Jul  7 15:22:08 nag_srv nagios: PASSIVE SERVICE CHECK: SERVER_nodo1;Eth1 Ancho de Banda;0;UNKNOWN on inteface eth1 : 0 % , Enlace a Unknown!

You can notice that after 10 sec we receive a second nsca event with the same Status Information but an OK status value. Something
that I finde wrong. I’ve checked the NRPE log on the servers and the checks output are right, ony one execution each period and correct
output value. The problem must be located at the LOCAL NAGIOS. One aditional difficulty is that we only own the servers and the CENTRAL
NAGIOS, so we can’t direct work on the LOCAL NAGIOS; we can only suggest the modification to its administrators.

I hope anyone can help me to find out what’s wrong in the system and correct it.

Thank you very much in advance.