Hey guys. this is a rather advanced question regarding distributed monitoring with “central” server redundancy.
Currently i’ve got two “Central” nagios servers which receive passive check results from 7 distributed nagios servers via NSCA.
My OSCP command on the distributed servers is just running nsca twice:
/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$return_code” “$4” “$5” | /usr/bin/send_nsca 18.104.22.168 -to 5 -c /etc/nagios/send_nsca.cfg
/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$return_code” “$4” “$5” | /usr/bin/send_nsca 22.214.171.124 -to 5 -c /etc/nagios/send_nsca.cfg
The “-to 5” option tells nsca to time out after 5 seconds. My OSCP command timeout in nagios is set to 11 seconds. As you can see, it executes both nsca commands, sending results back to both central servers. This works great until one of the central servers goes down. When one of the central servers goes down, the server that is still up doesn’t receive results either! Results cut out 100%. none come in at all. If i change the order in which the commands run, that doesn’t help either!
This is especially strange because when I duplicate the issue in a test environment (ie: have one “central” nagios up and accepting nsca results, and a fake IP entered for the other one), the “central” nagios server receives results. This test environment only had 20 or so services.
i’m stumped on this one. Is the distributed nagios process getting overwhelmed because OSCP is timing out? is the generic submit_check_result script faulty? Why does this method work fine in a smaller test environment? Any examples of you guys’s redundant distributed setup is appreciated too. I’ve thought about having the distributed servers OSCP script be robust enough to realize when the server they are sending to is down and failover to the other one, but i’d rather have it send to both.