Distributed nagios inconsistent results

basvanveen · January 27, 2010, 4:51pm

Hey,

i’m getting pretty annoyed by the fact that nagios is serving different states whenever I refresh my master-GUI. If I for instance look at a certain service result it can be CRITICAL at one moment and when I hit refresh it will be OK (and vice versa). When I look at the duration of the check result it can easily say that it has been CRITICAL for a few hours, but when I hit refresh it’ll say that it has been OK. Can someone explain to me how this can occur?

Here are some specifications of my setup :

hosts : 1000+
services : 4000+

we have one master receiving checks from 6 different slaves through nsca the master itself doesnt do any checks it only processes the passive results. Can’t see anything unusual in the event or debug log, if we scale down the number of results the problem doesn’t occur.

Hope someone has an explanation for this behavior

Best Regards,
Bas

luca · January 29, 2010, 12:27pm

Stop nagios.

check for any survivor processes with ps -ef | grep nagios
kill these processes.

Start nagios.

If it solves your problem NEVER use /etc/init.d/nagios restart again

basvanveen · January 29, 2010, 1:32pm

Luca thanks for your reply! processlist doesn’t seem to be strange, although I do see a lot (10-15) /usr/local/nagios/bin/ndo2db-3x -c /usr/local/nagios/etc/ndo2db.cfg sometimes.

I disabled the ndomod and ndo2db module yesterday and this seems to make the setup stable. Is there some way that the ndo can cause the strange behavior and if so what can I change to make it stable ? We use the ndo2db extention to fill a database with details for our cacti/eventhandling so it’s not much of an option to disabled it for a long period.

luca · January 29, 2010, 5:20pm

Hello, you didn’t say if you tried checking the process list after stopping nagios.

basvanveen · February 1, 2010, 9:37am

woops my bad, I misinterpreted your answer seems your right on the lost processes :

ps aux | grep nagios.cfg
nagios 11355 2.7 0.1 20884 9024 ? Ssl 10:05 0:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

-bash-3.2# /etc/init.d/nagios stop
Stopping nagios: No lock file found in /usr/local/nagios/var/nagios.lock
(after this the process is still alive)

after a /etc/init.d/nagios start >>
-bash-3.2# ps aux | grep nagios.cfg
nagios 4042 2.0 0.1 20992 9196 ? Ssl Jan31 15:16 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11355 2.4 0.1 20884 9028 ? Ssl 10:05 0:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 19658 0.0 0.0 2920 716 pts/0 R+ 10:38 0:00 grep nagios.cfg
nagios 24477 2.8 0.0 19860 8160 ? Ssl 10:25 0:22 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

so you suggest running /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg manually ?

luca · February 1, 2010, 10:23am

the only thing to do is NOT to run the restart command. Use Reload, or stop and then start.

basvanveen · February 1, 2010, 4:23pm

ok thanks for the suggestions will try that this week and see if it’ll have the right effect, I still wonder why this only occurs on a restart :S