Nagios hanging


#1

Hi all.
I’m curious if anybody else has this problem…

The nagios install is running quite fine together with nagiostat. The problem is sometims it looks like Nagios simply stops executing checks. In fact nagios IS running (and possibly even performring the checks) last check times where all two hours back… (and the nagiostats graphs stopped at the same time.

root@submariner(sun58):/# ps -ef | grep nagios
nagios 23770 23769 0 0:00
nagios 5389 1 0 02:01:02 ? 8:52 /export/nagios2/bin/nagios -d /export/nagios2/etc/nagios.cfg
nagios 23769 5389 0 07:27:10 ? 0:00 /export/nagios2/bin/nagios -d /export/nagios2/etc/nagios.cfg

Killing the 23769 process has brought Nagios back to life… and all last check times went somewhrere near the actual time… going to check some http logs if nagios checked the sites in the meanwhile or what else…

At the moment to limit this i have a “nagiosreload” script running each night. so at most it looses a day (but thats BAD anyway)

Any ideas from those running 2.03b? BTW I have had this with all nagios2 releases.

Thank you in advance, Luca

PS: Running on solaris 8 together with an MRTG instance.

EDIT: An update on this, in fact killing the above process awoke nagios but only for a real short time, looks like it did a complete check of all hosts/services and hanged again…
Edited ]


#2

Sorry, but I’m running rock solid with Nagios 1.2 using nagiosstat to keep graphs. No troubles.


#3

Jakkedup… i know…
“never touch a running system” :smiley:

Having done the mistake :slight_smile:

Luca


#4

a small update on this one…

it just hung again…

ps -ef | grep nagios gave this as a result:

root@submariner(sun58):/# ps -ef | grep nagios
nagios 16926 16925 0 0:00
root 20698 20696 0 16:52:17 pts/1 0:00 grep nagios
nagios 16925 24670 0 14:33:20 ? 0:00 /export/nagios2/bin/nagios -d /export/nagios2/etc/nagios.cfg
nagios 24670 1 0 02:01:01 ? 20:30 /export/nagios2/bin/nagios -d /export/nagios2/etc/nagios.cfg

killing -9 the 16925 process brought nagios back to life and the dead child vanished… BUT i had no more info forwarded to nagiostat… maybe it sometimes helps to pinpoint the porblem :slight_smile:

Luca


#5

That’s too bad. I was hoping v2.0 would be out in a stable version soon. We actually need a good nagios system up and running, so I’m not about to play around with the beta much yet. I installed it to help answer some of these questions, and found tons of bugs just triing to get it to start, let alone run stable.


#6

in the meanwhile i tried disabling nagiostat as it could have something to do with the hangups…
let’s see what happens :slight_smile:

Luca


#7

thought i got him… but it wasn’t the usual problem…
nagios was hanging and this time the console said nagios wasn’t runnning which it usually doesn’t… but it was because nagiostat’s debug.log file filled the disk… emptied the file and all is normal again… quite error tolerant :slight_smile:

Luca


#8

HUH?? you mean that nagios was hanging due to local disk filling up? Is that what you are saying?
I hope you have changed
my $DEBUGLEVEL = 0;
in your nagiostat file if that is the case.
You really only need the output when you add new things, or want to take a look for little problems.


#9

yes but it was not my usual hangup problem…
and i changed the DEBUGLEVEL to 0 (forgot it last time i added up some services which weren’t working) :slight_smile:

Stilll on the look for my other hangup problem… :frowning:

Luca