Redundant failover configuration

alannon · September 2, 2009, 3:24pm

I have two sites, currently in a distributed configuration. Everything is working swimmingly: updates from the remote come in correctly, we get notifications in a timely manner regardless of site. If the secondary goes down, the primary actively checks the remote hosts and services.

However, I would like to put them in a combined distributed and failover configuration. That is, if the primary server goes down (or becomes general noncontactable), the secondary becomes a fully active server (including attempting to monitor the primary site). The problem I’m having is that to maintain continuity I would have to have passive checks going from the primary to the secondary. I’m worried that doing so could result in a “passive check loop”: check_whatever on either server triggers, and the updated information goes to the other server, which triggers OCSP and sends the information back, and so on.

Is this actually a concern? If so, can someone think of a reliable (and hopefully simple ) way to stop such a loop from forming?

luca · September 2, 2009, 5:36pm

i never thought about complete failover… but if the main server crashes it probably wouldn’t send passive checks anyway to secondary…
you could just put a simple “is nagios running” check in some form on the secondary and have it warn you if the primary isn’t working anymore…

alannon · September 2, 2009, 6:07pm

Sorry, my explanation was a bit…circuitous.

What I’m worried about is actually during normal operations. I would much prefer to have full logs on each server, the normal way to do this is via passive checks from the ‘active’ server to the ‘passive’ server. So in my case the remote server acts as a normal distributed node (as normal, set up per the docs), but the primary would also be sending passive checks to the remote (purely for the purposes of data retention).

For example, we’re running check_whatever as an active check on the primary. Every time check_whatever runs, the primary server does the normal thing, then performs ocsp_command and sends the result as a passive check to the remote server. Ideally the remote server simply logs the check, and the primary does whatever would normally be done.

However, my concern is that since both servers would be forwarding check results to each other, a single check would in effect get bounced between the two. To continue my example above, check_whatever is logged by the remote server which then performs it’s own normal ocsp_command (being a distributed node) and fowards the check result back to the primary…which originated the check in the first place. The primary would receive a new check result, log and display it, and forward to the remote server, and so on ad infinitum.

luca · September 2, 2009, 6:43pm

what about creating two instances of nagios on the same server? (using different ports and whatever) one acting as primaryu and the other as secondary… or try a different way of replicating the logs… (i used syslog-ng logging on a mysql DB and used a DB replica to have copies of the logs)

alannon · September 2, 2009, 7:06pm

The site are physically separated, so separate physical servers.

I’ve considered just syncing the primary logs back to the secondary (via rsync or a logging system), I was just curious if anyone had tried the same thing and come up with a purely internal solution.

Thanks for the suggestions, luca.

luca · September 2, 2009, 7:08pm

mysql db replication works quite well. if you only want the nagios logs it shouldn’t be too much.

Let us know with what you come up in the end, i think it may be useful even for different purposes