Nagios init script creates concurrend daemons

Environment: RHEL4 U5 (x86_64)
Nagios Version: 3.0.4, 3.0.6
Symptom:

Scenario 1:

  • Your check results seem to be frozen at a single point in time
  • You see multiple lines in nagios.log reading:
    Warning: fork() in my_system() failed for command …
    Warning: Could not stat() check result file ‘/var/nagios/spool/checkresults …’
  • Memory consumption of one of the nagios processes is very high, up to remaining physical memory - your machine is swapping
  • restarting seems to solve the problem but, after a couple of hours, same error occur

Scenario 2:

  • Your check results seems to switch each time you refresh your browser window (F5) (“OK”… “UNKNOWN”… “OK”)

Research:
I figured out that multiple parent daemon processes were running (ps -C nagios -o pid,stat | grep s). But how could this happen?

Nagios binary looks for nagios.lock file for any running processes when it starts. It it finds one, it tries to find the corresponding PID. If this PID is not alive, it starts. But - if the nagios.lock file does not exist at all - nagios starts without trying to find another instance. (Should be useful if you intentionally run independent instances of nagios.)

If you use the init script /etc/init.d/nagios to stop nagios, it will try to kill the nagios process and wait for up to 10s. If this fails, a message appears ‘Warning - nagios did not exit in a timely manner’ - but some files including the nagios.lock file will be removed anyway. So - in consequence - if you run “/etc/init.d/nagios restart” and nagios is not able to shut down within time you have two instances running - bang.

Solution:
To verify the presence of two or more nagios daemons i wrote a passive check script. So - if you implement this you will have a short time to recognize this situation - until nagios stops working as you expect :roll:
May be you want to add some email feature to get an email directly from the script, not from a half dead nagios.

Additionally I suggest to change the init script, so that the lock file will not be removed as long as the process dont want to die.

The two scripts are present as attachment. Feel free to use it and change it to your needs. Be careful simply overwriting your init script with mine, please backup and patch your own.

[attachment=0]check_nagios_proc.tgz[/attachment]

I changed line 174 to not remove ‘$NagiosRunFile’ and added line 171 to do so (in case nagios is really dead).