Check plugins for services.cfg nagios 1.4


#1

Hello.

I need some major assistance with a problem I am unsure of how to solve.

I notice that when I use check_ping to monitor a service in service.cfg, nagios will often times tell me something is down when it is not. A false positive. I know it’s not down due to I can ping it, and the host never goes down, which is also being pinged.

I had tried to change these over to polling (snmp) instead of pinging. Thinking maybe it wanted an actual service to monitor, and I’ve only seen this problem with ping, and nothing else that I monitor.

Well, now I see the same problem with polling using check_snmp.
I have enough bandwidth, its not pegging the box or the core network it sits on. I also have tried a new box with a diff eithernet card. I needed a faster better box anyway.

Also this is database drivin using postgres.

To make this even more fun. We’re monitoring 853 hosts, and counting.

Also could I eliminate the service.cfg and just use hosts.cfg without nagios warning me that ‘there is no service associated with this host’ ?

Does anyone have any insight on this one?

Thank you for any advice.


#2

You don’t have to have a service defined for every host, but why?
What is your
Check Execution Time: 0.00 sec 0.27 sec 0.089 sec
Check Latency: 0.00 sec 0.43 sec 0.120 sec
Most likely, you are showing a high latency due to having too many active nagios checks. Consider a passive setup.
Also, it might help to know what problem you are having. They ping but your check_ping fails? How is that again? A bit more detail would help that sound more logical.


#3

I was thinking about removing the pings/polls.

[quote1168539111]
What is your
Check Execution Time: 0.00 sec 0.27 sec 0.089 sec
Check Latency: 0.00 sec 0.43 sec 0.120 sec

hmm honestly not familiar with how to see those results. Other than by just going by what nagios reports back. In which case there is no (out of the ordinary) latency.

[quote1168539111]
Most likely, you are showing a high latency due to having too many active nagios checks. Consider a passive setup.

Okay, I shall read up on that!

[quote1168539111]
Also, it might help to know what problem you are having. They ping but your check_ping fails? How is that again? A bit more detail would help that sound more logical.

Yes, if i look at the list of hosts that I ping, they show fine, If i look at the list of services that I ping, often times, a group of them will show down, like 15 or 20 services that begin with the letter V, for example, most are in alphabetical order. :slight_smile:

I just know that we don’t have 20 VSG’s down all at once too :wink:

So both service and host check via ping, but the host shows fine and service doesn’t, and its speratic in behavior. Does this make things clearer?

Something I wonder if this has to do with the service freshness, cuz it only happens sometimes, so I’ll start to mess with some settings, It’s got to be something with a setting somewhere :slight_smile:


#4

Freshness checking is for passive checks. Are you scheduling host checks? If so, why? I ask because you say your host checks are fine, but the service pings are not. Just a FYI, nagios doesn’t run host checks every 5 minutes, unless you have specifically configured it to do so, and you shouldn’t. It severely impacts perfomance.
The output of a service check would help alot more than just “pings fail”.
Provide the latency and execution time info by looking at Performance Info on the nagios website.


#5

no, not scheduling host checks here.

Alright, next time it goes flakey on me ill get a few examples. It may be today, it may be tomorrow, but I’m sure it’ll happen again :).

Thanks!

PS.
Check Execution Time: < 1 sec 5 sec 0.364 sec
Check Latency: < 1 sec 66 sec 18.716 sec


#6

Hopefully this gives some idea. Those VSG’s where not down at all.

select host_name,plugin_output,execution_time,latency from servicestatus where service_status = ‘CRITICAL’;
host_name | plugin_output | execution_time | latency
------------------------------------±------------------------------------------±---------------±--------
SomeOne | SNMP problem - No data received from host | 8 | 265
MainStreet_B3K_AP3 | SNMP problem - No data received from host | 9 | 254
SomeTwo | SNMP problem - No data received from host | 9 | 263
VSG1 | SNMP problem - No data received from host | 8 | 265
VSG2 | SNMP problem - No data received from host | 8 | 264
VSG3 | SNMP problem - No data received from host | 9 | 254
VSG4 | SNMP problem - No data received from host | 8 | 261
VSG5 | SNMP problem - No data received from host | 8 | 269
VSG6 | SNMP problem - No data received from host | 8 | 265
VSG7 | SNMP problem - No data received from host | 8 | 264
VSG8 | SNMP problem - No data received from host | 8 | 254


#7

At times, devices won’t reply to snmp requests in a timely fashion, due to priority being given to other task’s. You may have to increase the timeout for your plugins in nagios.cfg


#8

Alright.

Lemme show you what I currently have them set too.

service_check_timeout=30
host_check_timeout=15
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5

Does this seem reasonable? It does to me, but then again, I’m not entirely sure!

Thanks again!


#9

Also something else to note is the #defined setting in the source code for check_snmp.c itself.

#define DEFAULT_COMMUNITY “public”
#define DEFAULT_PORT “161”
#define DEFAULT_MIBLIST “ALL”
#define DEFAULT_PROTOCOL “1”
#define DEFAULT_TIMEOUT 1
#define DEFAULT_RETRIES 5

could this factor in to my issues?


#10

Yea, those settings in nagios.cfg look fine.
So use the -t option in check_snmp to increase the timeout of the plugin from it’s default. Keep increasing it, until you stop having the periodic problem.

Your last post simply shows the defaults used, if you don’t specify them when you run the command.


#11

Alrighty then, I made changes to it already apparently. It’s default was set to 1. When I modified it today it was set at 2, so I moved it to 5.

Wish me luck! :slight_smile:

Thanks again.


#12

Good luck!