Cached status causing flapping?


#1

Hey guys. Looks like I have something set up wrong.

I have noticed that the nagios web interface seems to be caching old information. I have disabled proxies and force refresh with shift. The same behaviour occurs in IE and Firefox.

Here is a series of refreshes for a host.

Last status check 13:21 13:25 13:25 13:26 13:30 13:31 13:32
Last stage change 13:10 13:05 13:10 13:05 13:10 13:10 13:05
Last host notificat 13:10 n/a 13:10 n/a 13:10 13:10 13:05

The status check time seems correct, and i believe the records where the host notifications are n/a are correct. Where are the other records coming from?

regards

Marty
Edited ]


#2

I’m not sure what you are looking at, perhaps a link would help. But one thing you might not realize, is that host’s are never checked unless a service_check fails, in version 1.2 and earlier.


#3

It is the same for services. For example
/nagios/cgi-bin/extinfo.cgi?type=2&host=apchndc01&service=Ping

Basically I have been playing with "Disable notifications for this service"
Initially the notifications are disabled.
I hit disable notifications.
I hit force-refresh until after ten seconds or so, notifications are disabled.
I continue to hit force-refresh every minute, but almost every time it flaps between showing information prior to the notifications change and after.

I don’t believe the problem is anything to do with notifications however, it just makes the problem clearer. I can do this for any page in the nagios web interface and watch the information cycle.

Where could nagios be caching the old information?


#4

The status is kept in files or a mysql database, depending on how you compiled it. It almost sounds like a browser problem, rather than a nagios problem. I don’t see how nagios could be reporting old data to you, but yet, the previous minute, the data was fresh. Sorry, but I’m at a loss.


#5

check you don’t have two instances of nagios running. stop nagios, wait a minute kill all nagios services still alive and start it.

Luca


#6

Good call luca.
Try a ps -ef|grep nagios
just to make sure they are all dead.
I’ve had a few of them running before, and don’t know why.


#7

Guys

Thanks for the tips.

apausprx01:~# ps -ef | grep nagios
nagios 23396 1 0 Mar14 ? 00:00:10 [nrpe]
nagios 27642 1 13 Mar14 ? 4-13:46:55 [nagios]
nagios 18026 1 13 Mar15 ? 4-12:55:00 [nagios]
nagios 27366 1 13 Mar15 ? 4-12:19:59 [nagios]
nagios 1166 1 13 Mar15 ? 4-11:54:57 [nagios]
nagios 10889 1 13 Mar15 ? 4-11:21:48 [nagios]
nagios 14818 1 13 Mar16 ? 4-07:45:14 [nagios]
nagios 5151 1 12 Mar16 ? 4-03:38:39 [nagios]
nagios 14999 1 0 Mar18 ? 00:07:19 [nagios]
nagios 23557 1 0 Apr14 ? 00:00:52 [nagios]
nagios 2835 23557 0 09:11 ? 00:00:00 [nagios]
nagios 2836 2835 14 09:11 ? 00:00:00 /usr/bin/perl /usr/local/nagios/nagiostat/nagiostat -p 1113779470|!!|apchndc02|!!|Ping|!!|OK|!!|PING OK - Packet loss = 0%, RTA = 433.70 ms|!!|

Is there meant to be that many nagios processes running???


#8

[root@tia2665 etc]# ps -ef|grep nagios
nagios 1898 1 0 Apr17 ? 00:12:49 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

I wonder why you don’t get the entire line, as i do above.
How do you start nagios anyway?
Mine is started in /etc/rc.d/init.d/nagios, which starts the daemon with command that you see above.

anyhow, if you have done /etc/rc.d/init.d/nagios stop, then you should no longer see any nagios procs running.

and yes, there will be many of them running at times, since you most likely have nagios.cfg with this in it:

MAXIMUM CONCURRENT SERVICE CHECKS

This option allows you to specify the maximum number of

service checks that can be run in parallel at any given time.

Specifying a value of 1 for this variable essentially prevents

any service checks from being parallelized. A value of 0

will not restrict the number of concurrent checks that are

being executed.

max_concurrent_checks=0


#9

jakkedup

I figured out why the init.d/nagios script was failing.

For some reason as you also picked up, in the process list it does not show the process as /usr/sbin/nagios but as [nagios]

The check_nagios plugin always failed so it assumed it was running.

In the init.d script, i changed the /usr/sbin/nagios to [nagios] and then it started working properly.

But why is the process shown as [nagios] ???

It is debian (mainly stable, but some testing branch) and a 2.4.18 debian kernel.

Below is the portion of the init.d script that was causing the problem.

check_started () {
if -f /etc/nagios/check_nagios_db ]; then
if ! /etc/nagios/check_nagios_db > /dev/null; then
return 1 # isn’t started
fi
elif -f /usr/lib/nagios/plugins/check_nagios ]; then
if ! /usr/lib/nagios/plugins/check_nagios -e 15 -F /var/log/nagios/status.log -C ‘/usr/sbin/nagios’ > /dev/null; then
return 1 # isn’t started
fi
fi
if -f $PIDFILE ]; then
pid=cat $PIDFILE
if “$pid” ] && ps $pid >/dev/null; then
return 0 # Is started
fi
fi
return 1 # Isn’t started
}

Here is the full ps. As you can see, there are other processes which have the same [format]. They are all system daemons.

apausprx01:~# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Apr18 ? 00:00:04 init [2]
root 2 1 0 Apr18 ? 00:00:00 [keventd]
root 3 0 0 Apr18 ? 00:00:00 [ksoftirqd_CPU0]
root 4 0 0 Apr18 ? 00:00:01 [kswapd]
root 5 0 0 Apr18 ? 00:00:00 [bdflush]
root 6 0 0 Apr18 ? 00:00:00 [kupdated]
root 29 1 0 Apr18 ? 00:00:34 [kjournald]
root 413 1 0 Apr18 ? 00:00:05 /sbin/syslogd
root 416 1 0 Apr18 ? 00:00:00 /sbin/klogd
root 421 1 0 Apr18 ? 00:00:00 /usr/sbin/inetd
root 436 1 0 Apr18 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe
root 472 436 0 Apr18 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe
mysql 473 472 0 Apr18 ? 00:00:00 [mysqld]
root 474 472 0 Apr18 ? 00:00:00 logger -p daemon.err -t mysqld_safe -i -t mysqld
mysql 477 473 0 Apr18 ? 00:00:00 [mysqld]
mysql 478 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 479 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 480 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 481 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 482 477 0 Apr18 ? 00:00:03 [mysqld]
mysql 483 477 0 Apr18 ? 00:00:03 [mysqld]
mysql 484 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 485 477 0 Apr18 ? 00:00:00 [mysqld]
mysql 486 477 0 Apr18 ? 00:00:00 [mysqld]
nagios 536 1 0 Apr18 ? 00:00:00 [nrpe]
root 540 1 0 Apr18 ? 00:00:11 /usr/sbin/nmbd -D
root 542 1 0 Apr18 ? 00:00:00 /usr/sbin/smbd -D
root 548 542 0 Apr18 ? 00:00:00 /usr/sbin/smbd -D
root 549 1 0 Apr18 ? 00:00:16 /usr/sbin/sshd
root 561 1 0 Apr18 ? 00:00:00 /usr/sbin/winbindd
root 592 561 0 Apr18 ? 00:00:00 /usr/sbin/winbindd
root 641 1 0 Apr18 ? 00:00:00 [sendmail-mta]
root 659 1 0 Apr18 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.pid
root 675 1 0 Apr18 ? 00:00:00 /usr/sbin/squid -D -sYC
proxy 678 675 0 Apr18 ? 00:05:38 [squid]
daemon 694 1 0 Apr18 ? 00:00:00 [atd]
root 700 1 0 Apr18 ? 00:00:00 /usr/sbin/cron
proxy 708 678 0 Apr18 ? 00:00:01 (unlinkd)
root 713 1 0 Apr18 ? 00:00:00 /usr/sbin/apache
root 719 1 0 Apr18 tty1 00:00:00 /sbin/getty 38400 tty1
root 720 1 0 Apr18 tty2 00:00:00 /sbin/getty 38400 tty2
root 721 1 0 Apr18 tty3 00:00:00 /sbin/getty 38400 tty3
root 722 1 0 Apr18 tty4 00:00:00 /sbin/getty 38400 tty4
root 723 1 0 Apr18 tty5 00:00:00 /sbin/getty 38400 tty5
root 724 1 0 Apr18 tty6 00:00:00 /sbin/getty 38400 tty6
nagios 4050 1 0 Apr19 ? 00:00:20 [nagios]
root 1476 1 0 Apr20 ? 00:00:00 /usr/bin/perl /usr/share/webmin/miniserv.pl /etc/webmin/miniserv.conf
www-data 6589 713 0 Apr20 ? 00:00:00 [apache]
www-data 6593 713 0 Apr20 ? 00:00:00 [apache]
www-data 6594 713 0 Apr20 ? 00:00:00 [apache]
www-data 6595 713 0 Apr20 ? 00:00:00 [apache]
proxy 3532 678 0 06:26 ? 00:00:01 (dnsserver)
proxy 3533 678 0 06:26 ? 00:00:00 (dnsserver)
proxy 3534 678 0 06:26 ? 00:00:00 (dnsserver)
proxy 3535 678 0 06:26 ? 00:00:00 (dnsserver)
proxy 3536 678 0 06:26 ? 00:00:00 (dnsserver)
proxy 3537 678 0 06:26 ? 00:00:05 (squidGuard) -c /etc/chastity/squidGuard-chastity.conf
proxy 3542 678 0 06:26 ? 00:00:00 (squidGuard) -c /etc/chastity/squidGuard-chastity.conf
proxy 3543 678 0 06:26 ? 00:00:00 (squidGuard) -c /etc/chastity/squidGuard-chastity.conf
proxy 3546 678 0 06:26 ? 00:00:00 (squidGuard) -c /etc/chastity/squidGuard-chastity.conf
proxy 3547 678 0 06:26 ? 00:00:00 (squidGuard) -c /etc/chastity/squidGuard-chastity.conf
www-data 12670 713 0 08:24 ? 00:00:00 [apache]
www-data 12824 713 0 08:25 ? 00:00:00 [apache]
www-data 23474 713 0 10:41 ? 00:00:00 [apache]
root 25497 549 0 11:06 ? 00:00:00 sshd: root@pts/0
root 25568 25497 0 11:07 ? 00:00:00 sshd: root@pts/0
root 25577 25497 0 11:08 pts/0 00:00:00 -bash
nagios 29710 1 0 11:59 ? 00:00:00 [nagios]
nagios 29714 29710 0 11:59 ? 00:00:00 /usr/lib/nagios/plugins/check_ping -H 172.18.102.130 -w 9999,99% -c 10000,100
nagios 29716 29714 0 11:59 ? 00:00:00 [ping]
nagios 29724 1 0 11:59 ? 00:00:00 [nagios]
nagios 29739 1 0 11:59 ? 00:00:00 [nagios]
nagios 29741 1 0 11:59 ? 00:00:00 [nagios]
nagios 29743 1 0 11:59 ? 00:00:00 [nagios]
nagios 29745 1 0 11:59 ? 00:00:00 [nagios]
nagios 29746 29741 0 11:59 ? 00:00:00 sh -c /usr/lib/nagios/plugins/check_tcp -H 172.18.68.9 -p 139
nagios 29747 29739 0 11:59 ? 00:00:00 /usr/lib/nagios/plugins/check_tcp -H 202.73.43.26 -p 22
nagios 29748 29745 0 11:59 ? 00:00:00 /usr/lib/nagios/plugins/check_ping -H 172.18.100.110 -w 9999,99% -c 10000,100
root 29751 25577 0 11:59 pts/0 00:00:00 ps -ef


#10

from uw713doc.sco.com/en/man/html.1/ps.1.html

Under the -f option, ps tries to determine the command name and arguments given when the process was created by examining the user block. Failing this, the command name is printed, as it would have appeared without the -f option, in square brackets.


#11

I upgraded the debian kernel from 2.4.18 -> 2.4.27 and then the nagios process displayed properly under ps. I put back the original init.d script and now it works sweet!

thanks guys