Host Down Delays

dgoyette · January 19, 2006, 2:33pm

I am familiar with the following text from the Nagios documentation:

“Also of note - when Nagios is check the status of a host, it holds off on doing anything else (executing new service checks, processing other service check results, etc). This can slow things down a bit and cause pending service checks to be delayed for a while, but it is necessary to determine the status of the host before Nagios can take any further action on the service(s) that are having problems.”

However, I interpretted that to mean that Nagios would need to delay future service checks on a particular host while it determined if the host was up or down. In reality, I’ve found that it delays all checks on all hosts. I’m wondering if this is fully intentional, or if there is a workaround.

As an example, we have a moderately sized network with approximately 300 hosts being monitored by Nagios. Now, if some accident happens such that 100 of these hosts are suddenly unreachable by Nagios, I have observed that nagios will check the service on one of the hosts, find it is down, and check the host, and determine it is down, before moving on to any other services. This means that if it takes, for example, 20 seconds to confirm that the host is down, it might be a half hour before nagios notifies that each of those 100 hosts is down. This seems excessive.

Is there any way for Nagios to be a bit more clever about this? For it to understand that if one host is down it doesn’t need to pause every other check on every other host while it checks the status of the first host? Not only does this delay important outage information, it also delays our performance data checking services, such that the performance data we collect and log gets thrown off, basically unavailable during an outage, even for those services that are not affected by the outage at all.

I know about dependencies as a way to prevent certain cases of this happening, but there are always times when things go wrong in an unpredictable way. It seems that Nagios behaves most poorly in cases when quick responses are needed most.

luca · January 19, 2006, 3:35pm

if a hundred hosst go down it’s probably a router or switch going down… so it should be an unrechable state and not a host down… so you should disable unreachable notifications and use parents to find the host which is effectively down…

Not sure if if i understood your problem correctly, but 1/3 of the hosts going down for differnt reasons i think it doesn’t change much if i get paged 100 times in a minute or once every 20 seconds… possibly the second one is even better… it could even wake me up…

Luca

system · January 19, 2006, 3:37pm

Well, IMHO, if you’ve got 1/3 of your hosts going down, you’ve got a bigger problem to worry about than Nagios delaying checks on other hosts.

What you might do to trim down latency is to cut out some unnecessary retry checks. There’s a section in the docs about fine-tuning Nagios for better performance you could check out. I don’t know if there’s a way to work around how Nagios checks the services like that.

jakkedup · January 19, 2006, 5:35pm

Nagios can’t be smart about making checks. Just by the shear fact of is existance. We are monitoring hosts and services and nagios can’t magically ignore a service check. Quote: “I have observed that nagios will check the service on one of the hosts, find it is down, and check the host, and determine it is down, before moving on to any other services.” and that is EXACTLY what you want it to do. That’s the only reason we use nagios in the first place.

Now, if nagios is showing that you have 100 hosts down, and that in fact, they are not down, then that is not nagios’ fault, but your own.

You need to take a hard look at why you have 100 hosts down, when in fact, all that is down is s switch port, router, etc. Now, make nagios smarter by making your hosts.cfg file smarter. It’s imperative in your case, to use the parent/child relationship and setup nagios in the fashion that I’ve described dozens of times in the forum already.

In other words, if you look at your status map cgi page, and the layout that you see is NOT identical to how your network is CABLED together, then something is wrong.

Does your status map show the switch port that the nagios box is plugged into, does it show that switch connected to another port on a different switch, does it show that switch with a port connected to a router… and so on till finally it shows some port on a switch connect to the eth0 card of the host that you REALLY wanted to monitor in the first place.

I understand that alot of people get nagios and ONLY want to monitor hosts. So they end up with a status map that looks like hell, with ONE nagios host in the center and 100 host all connected to it. Well, you get what you sow. If that is what you have for a status map, then you can expect to have BIG troubles when your router goes down, or a switch port.

jakkedup · January 19, 2006, 5:37pm

Just a FYI, yes, my status map shows EVERY connection from nagios box, all the way to the host in question. Every port used on a switch, every router used, every ethernet card. The reason? If someone unplugs a cable or cuts a cable, I know it, and I know exactly where to look and how to replace it. Actually, you could remove every piece of fiber/cat 5 from the entire network and I could plug it all back in, just from looking at the status map.

dgoyette · January 20, 2006, 11:23am

Thank you for your response. That gives me some hope. Just a question on the implementation of your suggestion, jakkedup:

I understand that parent/child relationships will reduce notifications, but will it actually reduce the time it takes to handle each host? Meaning, if hosts A and B are considered children of host X, and X goes down, Nagios will need to check if host A is up, then check if X is up, and then page about X but not A. Then, when it goes to check if B is up, does it remember that X is still down, or does it need to do another check of X every time one of its children is unreachable?

I ask, because it seems that parent/child relationships would significantly increase the number of host checks during an outage period. Instead of checking A, and calling it down, then B, and calling it down, Nagios would check A, then X, and A unreachable and X down, then check B, then X, and call B unreachable and X down. It seems like twice the number of checks, and twice the time, to respond, and that’s just with a single layer. I would prefer not to have to fire off more host checks if I can avoid it, unless it’s simply something we have to deal with.

A previous poster offered that if we have an outage we should not be too worried about the loss of execution of other service checks. In response to that, it is not atypical for several hosts to go down at once, due to a network issue between nagios and a sub network, causing several hosts to go down (or be unreachable). We use Nagios for two things: Monitoring actual service checks, and using the service check performance data to generate performance graphs. It is just unfortunate that even a small scale outage of a few hosts can invalidate several minutes worth of performance data for every service check we have running, simply because nagios was to busy checking hosts to run the service checks.

Anyway, I’ll test the parent relationships in the mean time to see if it saves time in effect. Thanks for the replies.

jakkedup · January 20, 2006, 5:55pm

Without parent/child you are making it very hard for nagios due to notifications and having to perform every service check over and over. Just the opposite of what you are thinking.
nagios.sourceforge.net/docs/1_0/ … ility.html
nagios.sourceforge.net/docs/1_0/ … tages.html

I’m pretty sure, that when it’s time for a service to be checked again, if it sees that it is not on the host that is causing the network outage, it will walk the tree to see if the blocking host is up yet or not. I could be wrong. But since nagios utilizes a “priority” service check system, the priority given to a “network blocking” service is higher than a normal service check.

Bottom line is, if you dont want 100 emails for hosts being down, when in fact, the only thing that is down is a 1 router, then make the router the parent for all 100 hosts. If later you find you need to add switches, or ports on a switch, etc, then add them also.

My first experience with network trouble was a cable plugged into a port on a switch. Reseating the cable fixed it. So after that, I setup nagios showing exactly how port 1 on switch A connects to port 25 on switch B, etc. by using host parents. Yes, I defined a port on a switch as a “host” and it’s service check is check_snmp for the interface status.

jakkedup · January 20, 2006, 6:03pm

Yes it is unfortunate, that is why you need to fix the network asap. You can’t do that if all you have is nagios checking:

itself with the check_nagios check.
Checking 100 hosts that have 3 services each.
There is no way nagios is going to help you when you have a cable unplugged. You have to make a decision. Are you only concerned with perf data and services on hosts or do you have a need to make sure there is a network available for all of these services to be provided with. If you are like me, I want both. I want to fix the network asap so that we can again have nagios collecting stats for services. You can’t do that, if you don’t have nagios checking each and every connection/port/router/switch between the nagios server and the host(s) that are important to us.

Here is a pic of 2 switch STATCKS connected together via 2 fiber connections in my status map Spanning tree is blocking one path to prevent a network loop and provides for redundancy. The pic shows:

Switch names Sparc-As400 and Sparc-OHR
Unit 1 port 25 connects to Unit 1 port 25 in one path.
Unit 2 port 25 connects to Unit 2 port 25 in the other path.

By having the hosts.cfg file setup with parents, if one host goes down, I will know within 5 minutes exactly where the network block has occured. Network will be operational in 10 minutes or so, depending on problem.
http://www.cybaclan.com/nuke/modules/coppermine/albums/userpics/10002/status.cgi11.jpg
Edited Fri Jan 20 2006, 01:29AM ]

dgoyette · January 23, 2006, 1:12pm

Thank you again. I see your point.

I’m about to start going through our hosts and adding parent relationships. One question you might be able to answer on this, since it seems like you have your network set up this way, judging from the image you attached, is this: (I’ll try to give a simple example).

Nagios allows me to specify multiple parents. I’m wondering what the direct effect of this is. I did not see it on the “Determining Status and Reachability of Network Hosts” section of the docs.

So, for example, if I set the parents of host A to B and C, (Host A has two parents, namely, B and C), and only one of the parents goes down (in this case, host B ), does nagios know to consider host A still reachable via host C (which is still up) ?

I’m just wondering this because we have a a lot of redundancy on our network, and I wanted to be sure nagios would behave intuitively with this setup.

Thanks again,

-Dan
Edited Sun Jan 22 2006, 08:13PM ]

dgoyette · January 23, 2006, 2:28pm

I figured I should be more specific.

Consider this network layout. It is overly simplified to make things easier, but it’s not so far off from reality. Consider Router’s A, B , C and D, all interconnected redundantly. Nagios reaches the rest of the devices through router A, and Host Q is considered to be behind Router D.

http://home.gwi.net/~chudd/redun.png

Now, since this is a network, we imagine that there are several potential paths between Nagios and Host Q. The most direct path being:

[blockquote]Nagios → A → D → Host Q.[/blockquote]

In that case, D’s parent would be A.

However, if the link from A to D drops, then we would use one of the following two routes to get from Nagios to Host Q:

[blockquote]Nagios → A → B → C → D → Host Q[/blockquote]

or

[blockquote]Nagios → A → C → B → D → Host Q[/blockquote]

In the first case, B would be the parent of C. But in the second case, C would be the parent of B. Both are valid routes.

The question, then, is, What happens if B and C suddenly blow up? Nagios might start by checking B, determining that it is down, then checking it’s parent C, to determine if it is down as well. But when it checks C, will it notice that B is its parent, and not bother to page down, since it considers itself unreachable?

Overall, this is a question of how I should best set up child/parent relationships for redundant networks. For complex network, with mutiple paths from one point to another, it almost seems that I would need to specify each path that traffic might take to get from nagios to every device, assuming any link might break.

So, then, I wonder if nagios will handle the parent definitions gracefully if B is the parent of C, while C is the parent of B. Will it send out any notifications? Perhaps it keeps a stack of some sort going so it does not get trapped in infinite recursion? Something I can test, I imagine. Just wondering how other people set up their parent relationships, and whether they ever have a case where the parent of one device is also listed as that device’s child.

dgoyette · January 23, 2006, 2:44pm

Well, I confirmed by testing that nagios will fail to pass configuration test if it detects a circuilar parent/child relation. So, that rules that question out, somewhat. ut it doesn’t explain ho wnagios handles redunant networks.

luca · January 23, 2006, 2:47pm

wouldn’t it be best to create a test situation?

even if te interconnection doesn’t exist, if you have a host (possibly even a fake one) which you can take off the net you could take another host and put two parents (one is the host you will be taking off the net, the ther one will remain up).

Even if the hosts are not really parents (nagios doesn’t know this) you will be able to check what you are trying to find out…

Keep us informed

Luca

jakkedup · January 23, 2006, 3:25pm

If you looked at my pic, you see that I have a circular network and it is redundant. 2 fiber cables from switchA to switch B. I lied to nagios, in ONE of the host definitions is all, so that nagios wouldn’t cry about a circular network.
I did NOT lie in the primary path, but only lied about one of the hosts in the secondary/backup path. We use spanning tree to block one path. Others might use MLT’s (mulitilink trunks). If using a mlt, then make that mlt into one host. In other words, the host will signify all paths from A to B switch, so in that case, nagios won’t care one bit.

How you achieve redundancy, I don’t know. But I do know that a little liing to nagios isn’t going to hurt. My main goal, was to get a status map that shows all of the interconnections of switches/routers/server’s. The other benefit was that I get good reporting in which host is down and which are unreachable, due to parent definitions. OK, so I lied to nagios in a few places, no big deal.

jakkedup · January 23, 2006, 4:02pm

[quote=“jakkedup”] http://www.cybaclan.com/nuke/modules/coppermine/albums/userpics/10002/status.cgi11.jpg
[/quote]

I’ll attempt to describe how I got around the circular network problem in nagios.

My picture above is actually upside down, if you think of the parent as being on top. So ignore that. I used nagiosmap to move stuff around by hand.

The top dog in the picture is:
Sparc-OHR with parents=some_hosts_NOT_shown_in_picture
Sparc-OHR-1p25 with parents=Sparc-OHR
Sparc-AS400-1p25 with parents=Sparc-OHR-1p25
Sparc-AS400 with parents=Sparc-AS400-1p25,Sparc-AS400-2p25
The above complets the one path of the network.
The other path is:
Sparc-OHR-2p25 with parent Sparc-OHR
Sparc-AS400-2p25 with parent Sparc-OHR-2p25

Now that I look at it, I don’t even see how I thought I was telling nagios a fib. it looks correct to me.

system · January 23, 2006, 5:01pm

[EDIT: stupid me, clicking wrong buttons!]
Edited Mon Jan 23 2006, 12:05AM ]

system · January 23, 2006, 5:01pm

Well, in your case, Host Q is only accessible via Router D…so just make Router D the sole parent of Host Q.

Not sure how to handle the mesh network, though. I assume there’s some way to configure Nagios to handle it, but I’m not sure…
Edited Mon Jan 23 2006, 12:03AM ]

system · January 23, 2006, 5:04pm

GAH!! I’m confusing the EDIT button with the QUOTE button!!! Sorry, haven’t had anything to eat in about 20 hours and my brain is fried. Can we get some of my stupid posts deleted:?::?: I don’t mean to be spamming the board like this.
Edited Mon Jan 23 2006, 12:05AM ]