FIFO size limit on nagios.cmd


#1

Hi All,
We are building a distributed Nagios environment. We have monitoring nodes in our two main datacenters. The monitoring nodes forward check results to the central server via send_nsca. We are running on RHEL4 and have not changed the default named pipe size limit of 4096 Bytes. Our nagios command check interval is set at -1.

I have a limited understanding how named pipes work and I’m going theoretical here…

Is there any point where I need to worry about the named pipe getting filled faster then it is getting read? If so how will I know? Will there be an NSCA error message in /var/log/messages? Alternately is there any way to check how full the named pipe is at any given moment?

What have you experienced?

Thanks,
Brian


#2

Hey Brian,

WIth your nagios setting at -1, your system will tank long before you’re going to run into problems with pipe size. I’ve got 2400+ services coming in every 5-10 mins. A named pipe is just a handler for interprocess communication.

Have you had a look at using Cacti to graph your nagios data? Very useful, lots of tutorials on how to integrate the two.

In the future, you may want to set up redundant central servers so that you aren’t shit out of luck when a central server goes down. If you ever do this, be sure to grab OCP_Daemon. It’s a nice little perl script i use to handle my sending back to central servers, that way you can make a redundant central server and not have to worry about one or the other going down and losing data. Heres the wiki to it:
nagioscommunity.org/wiki/ind … OCP_Daemon
It also caches results. I’ve gotten bombed with a syslog-ng alert that went batshit crazy and spammed me with 80 megs worth of alerts to my nagios.cmd for a good 6 mins, and i still didn’t lose any graph data.

The reason i tell you to grab that is because nagios’s OSCP command doesn’t play nicely when it times out. you can set your submit_check_result script on your distributed nodes to send to multiple central servers, but when one goes down, nagios chokes and waits for OCSP to finish before processing and sending more service check results :frowning:

Little off topic but should save you a headache in the future


#3

For anyone else that may ask the same question I did find that nagiostat, at least on 2.9, provides statistics on name pipe usage.

Hi MP,
Thanks for th info on OCP_Daemon and the submit_check_result issue. Thats exactly how we have ours setup (forking the results to active and passive front-ends.) :cry:

We are currently using nagiosgraph to graph our performance data. I have not had the bandwidth to evaluate Cacti yet but it looks pretty interesting.

Thanks,
Brian