Multiple Central Servers, OCSP fails for both when one times

MP1 · July 17, 2007, 8:50pm

Hey guys. this is a rather advanced question regarding distributed monitoring with “central” server redundancy.

Currently i’ve got two “Central” nagios servers which receive passive check results from 7 distributed nagios servers via NSCA.
My OSCP command on the distributed servers is just running nsca twice:
/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$return_code” “$4” “$5” | /usr/bin/send_nsca 1.2.3.4 -to 5 -c /etc/nagios/send_nsca.cfg
/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$return_code” “$4” “$5” | /usr/bin/send_nsca 4.3.2.1 -to 5 -c /etc/nagios/send_nsca.cfg

The “-to 5” option tells nsca to time out after 5 seconds. My OSCP command timeout in nagios is set to 11 seconds. As you can see, it executes both nsca commands, sending results back to both central servers. This works great until one of the central servers goes down. When one of the central servers goes down, the server that is still up doesn’t receive results either! Results cut out 100%. none come in at all. If i change the order in which the commands run, that doesn’t help either!

This is especially strange because when I duplicate the issue in a test environment (ie: have one “central” nagios up and accepting nsca results, and a fake IP entered for the other one), the “central” nagios server receives results. This test environment only had 20 or so services.

i’m stumped on this one. Is the distributed nagios process getting overwhelmed because OSCP is timing out? is the generic submit_check_result script faulty? Why does this method work fine in a smaller test environment? Any examples of you guys’s redundant distributed setup is appreciated too. I’ve thought about having the distributed servers OSCP script be robust enough to realize when the server they are sending to is down and failover to the other one, but i’d rather have it send to both.

Thanks

MP1 · November 28, 2007, 2:51am

This was resolved by getting a nice perl script to handle NSCA sending. It’s called OCP_daemon, and you should definately be using it if you have multiple central nagios servers. Heres the wiki to it:
nagioscommunity.org/wiki/ind … OCP_Daemon

Highly recommended! Set up this little script on your distributed nagios sites, make a couple pipes and change a few lines of configs and you’re good to go