Too many alert msgs


#1

Hi

I’ve got a question, is it possible to set Nagios to just send ONE alert (or a defined amount) of alert messages if a device in the network goes down?

For example:
A network consists of 300 switches connected to a router. The router hangs up/freezes, whatever. Making Nagios send 301 alert SMS because it lost link to the router and all switches connected to it, effectively hanging the network administrators cellular.
Edited ]


#2

use parent dependencies and set nagios not notify you for “unreachable” alerts.
Not sure if this works correctly though. there were a couple of threads about this if im not mistaken.

Luca


#3

yes luca, it does work.

If you have 300 switches all connected to one router I wanna see this router. Never seen one with that many ports.

But if what you have is 5 networks connected to one router, and each network has ONE port on ONE switch connected to it, then surely you can specify in the hosts.cfg who the parent is.

Once you have your entire network layed out, and the parents all defined, your status map will look like the exact way your network is cabled up.

And if you have 500 switches connected to that router, please post a pic of it.

If you do not care to spend the time on defining what switch connects to what switch and who the parent host is for each of your switches, then there is NOTHING that nagios can do for you to decrease your messages to just ONE. Nagios will not know which one is the actual failure point and neither will you.

So you are telling me, that you have all this monitoring going on, but nagios isn’t telling you anthing except that your entire network is down.

But in fact, all that is down, is the cable in the back of your nagios server??? Well, then you didn’t use the parent feature of nagios very well. Your nagios pc process connects to the eth0 port of your pc, so show that by defining a eth0 check and give it no parent so it default’s to the nagios process.

Now eth0 connects to switch1/port10 so make a host called sw1port10 and make an snmp ifoperstatus check of that port. Make it’s parent the eth0 host… and so on. When you are done, it will look like my nagios, and show every connection from my pc to the switch to the router, to the 5 networks, to every switch in the building and EXACTLY how they are connected.

I know darn well that you don’t have 500 cables plugged into that router.


#4

To give you an idea, here is a pic showing ONE switch in one building connected to another switch in another building. It shows the port on the switches used, it shows the status of the TWO radio’s hardware including the outdoor radio transmitter, and finally the ports used on the other end including the switch itself finally.

A switch just doesn’t connect to another switch, it’s a port on a switch that connects to another port on a switch, so show that in nagios. By doing so, someone could unplug EVERY singel cable on our network, and I could get every switch/router/radio/etc connected back up, just from the nagios status map. If your nagios status map is not that thorough, then you can’t tell EXACTLY what port on what switch is your problem.


In the above pic, you would have gotten 2 emails, one for each down host.

If you noticed, there are 2 ways to get from point A to point B, using either radio A or B, so if both are down, then surely you want to get 2 emails. That picture is a perfect representation of what “parents” definitions in the hosts.cfg file will do for you.

Actually, I noticed that there is a flaw in my layout, there are a couple of them down, I’ll have to look at that to see why it shows up that way.

Edited Thu Aug 11 2005, 10:47AM ]


#5

Really, the reason that shows like that is due to a circular path, which is not allowed in nagios, but in real life, spanning tree blocks one path, so that there is no network loop. So you have to lie to nagios about things like that. But all in all, you get the idea, one host is down, and the rest of them are simply unreachable and in 99.9% of the time, your network will be layed out with no circular paths.


#6

No need to get all worked up. I’m still very new to Nagios, the company I’m a trainee at uses it. I only want to learn more about it, and perhaps try to improve on how it works for them now.

300 switches is not directly connected to the router, they’re connected via alot of other switches. Sorry if I confused things. And when I look at the “Status map” I can see a hierarchic view of the network overlay, which closely resembles how the network looks like out in the field.

Now the problem is this: IF one of the major switches, or the overlaying router goes down Nagios will send MANY e-mails. Resulting in alot of SMS. Therefore the company has disabled the checks of the “non-vital” switches, and the on-call person must check Nagios every hour or so to manually check if the shit is still alive.
What would really improve things would be IF we could enable the check for ALL devices in the network. BUT if > 10 devices dies at the same time, Nagios only sends a FIXED amount of e-mails. Can this be done??? And if so, plz tell me how or guide me to a link.

Thanx for the feed-back, even though I didn’t grasp how it’ll fix my problem.

/Gymmarn - Nagios noob :slight_smile:

[quote=“jakkedup”]yes luca, it does work.

If you have 300 switches all connected to one router I wanna see this router. Never seen one with that many ports.

But if what you have is 5 networks connected to one router, and each network has ONE port on ONE switch connected to it, then surely you can specify in the hosts.cfg who the parent is.

Once you have your entire network layed out, and the parents all defined, your status map will look like the exact way your network is cabled up.

And if you have 500 switches connected to that router, please post a pic of it.

If you do not care to spend the time on defining what switch connects to what switch and who the parent host is for each of your switches, then there is NOTHING that nagios can do for you to decrease your messages to just ONE. Nagios will not know which one is the actual failure point and neither will you.

So you are telling me, that you have all this monitoring going on, but nagios isn’t telling you anthing except that your entire network is down.

But in fact, all that is down, is the cable in the back of your nagios server??? Well, then you didn’t use the parent feature of nagios very well. Your nagios pc process connects to the eth0 port of your pc, so show that by defining a eth0 check and give it no parent so it default’s to the nagios process.

Now eth0 connects to switch1/port10 so make a host called sw1port10 and make an snmp ifoperstatus check of that port. Make it’s parent the eth0 host… and so on. When you are done, it will look like my nagios, and show every connection from my pc to the switch to the router, to the 5 networks, to every switch in the building and EXACTLY how they are connected.

I know darn well that you don’t have 500 cables plugged into that router.
[/quote]


#7

Hi, I’m a trainee at a network company so I’m not so good at Nagios yet. Trying to learn, by fiddling with a little Nagios box of me own :wink:

Yes, they have already done all that. The Nagios server will send an e-mail as soon as a bigger switch/router goes down. But for the lesser ones they’re set not notification at all. That’s because they get tons of e-mails from nagios if a bigger switch/router dies and a whole link goes down. This makes it so that the on-call engineer have to watch nagios manually every hour orso, would be better to get a SMS if a device went down.

SO to my question finally :slight_smile:
Can it be done so that Nagios only sends like 1-5 (or set amount) of e-mails disregarding how many devices in the network that goes down?? And if so, plz guide me to a page which describes this or show me some examples. It would be a great improvement :slight_smile:

/Gymmarn - Nagios noob :slight_smile:

[quote=“luca”]use parent dependencies and set nagios not notify you for “unreachable” alerts.
Not sure if this works correctly though. there were a couple of threads about this if im not mistaken.

Luca[/quote]


#8

As I told you, how is nagios to know “which 5 ot the 500 alerts” should it send? Is that really what you want? Absolutely useless. You might as well remove nagios completely. And don’t say “don’t get all worked up” either, I’m just telling you the way it really is.

What is the point of having nagios tell me that the router is down, and also 300 switches are down also? I want to fix the problem, not be told that everything on the network is broken!. If the parents directive is defined in the hosts.cfg correctly, then port10 on switchA connects to port 5 on switch B which connects too… etc And when a problem occurs, you get ONLY ONE email, stating port10switchA is down.

But, if your nagios is all setup like that, then it’s simply how you have defined your alerts. If you don’t want to be notified about “unreachable” then don’t define it that way.
services.cfg
notification_options w,c,r

Yours is most likely
notification_options w,c,r,u


#9

Actually, it should be your hosts.cfg that contains:
notification_options d,r
Down and recovered only. That way, you won’t get unreachable.
A “u” in the services.cfg means unknown, which may be something that you SHOULD be notified about, since that service check is just not getting a reply from the remote host.