Nagios don't check services


#1

Hello folks,
I try to repost my problem.
I was in 3.03 version, and upgraded to 3.05 but nothing changed.
I have some checks that i do in some periods. For example:
define timeperiod{
timeperiod_name 24x5_8am
alias 24 Hours A Day, 5 Days A Week , 8am
monday 08:00-08:15
tuesday 08:00-08:15
wednesday 08:00-08:15
thursday 08:00-08:15
friday 08:00-08:15
}

Ok , so i have a service (that check backup) that should run in this period.
In service state information i get:
Last Check Time: 11-10-2008 08:52:01
Check Type: ACTIVE
Next Scheduled Check: N/A

definition of service is

define service{
use generic-service
host_name AS400DEV
service_description Check_Backup
check_period 24x5_8am
normal_check_interval 10
check_command check_file!’/mnt/flr_as400dev/ok_bkp_dev.txt’!
}

I tried to restart nagios, reboot server, delete the retention.dat file as i read here.
But nothing changes.
This happen to some service, not all. It should be related to “custom” timeperiods maybe ?
Any help is appreciated
Regards to all


#2

Well, first look at the configuration, and see which services are failing with check execution. Maybe some variable blocks the checks. Check the template you’re using for those services. What does nagios.log say about those unchecked services?

I don’t see anything wrong with the timeperiod defined. Are you sure all those unchecked services are using the same timeperiod definition?


#3

As written by another user of this forum , what seems to be a workaround for this should be : stop nagios, delete those 2 .dat files , and change the max_check_attempts . Setting max_check_attempts 1 to the generic-service of templates.cfg , and restarting nagios, seems to solve the problem. Anyway, in my opinion there’s a bug in nagios. And i know how to re-do the problem : forcing an OK status to a “critical status” service (in my case a weekly backup) that i did with command /usr/local/nagios/libexec/eventhandlers/submit_check_result AS400DEV ‘Check_Backup’ 0 ‘Fake OK’. Anyway the workaround solved temporarely the problem. Thanks.


#4

Well, of course :slight_smile:
max_check_attempts need to be set to 1 if normal_check_interval is set to 10 (I assume those are minutes), and it is set to 10, and if retry_interval i set to higher number.

That means, if your check timeperiod is within 15 minutes, and if regular check starts at, for example 7th minute, then, with maxcheck_attempts set to higher then 1 and retry_interval higher then 8, the check will last more then 15 minutes, and check won’t be completed for that day.

I suggest, if you will have that normal_check_interval set to 10, to lower the max_check_attempt to 1 and hold it that way. In such scenario you will be sure your check will be completed in those 15 minutes.


#5

I understood … but not so much …
So…if i need to schedule a check only once day (let’s say at 14:00) from monday to friday…how could i correctly set th max_check_attempts and retry_interval values ??
Thanks again, very kind.
Regards


#6

Set:

max_check_attempts 1 normal_check_interval 5 retry_check_interval 1

That will mean Nagios will send a notification for Warning or Critical when he gets first Warning or Critical exit code from check_plugin.

means it will reschedule check every five minutes if exit status of plugin is OK, and

means it will reschedule the check every minute once it gets Warning or Critical exit status from a plugin.

I have set this example in a way that check_timeperiod will last at least a minute, but I assume you will let it at least 15 minutes like in yours previous example of timeperiods.

You can set this in different combinations, but the math of those intervals and attempts has to be respected,like this:

max_check_attempts 2 normal_check_interval 3 retry_check_interval 2
In this example, Nagios will check the service 2 times before it sends notification of a Warning or a Critical, with an interval of 2 minutes between checks (alltogether it will be 5 minutes, because it is said that normal interval is 3 minutes, and we start with OK status, so the check interval for that is 3 minutes. For that check we get, let’s say, Warning status, and now Nagios reschedules next check with retry_interval, which is 2 minutes, after that it gets second Warning exit status, and from OK state it’s been 5 minutes, and because we said max_check_attempts are set to 2, the notification is sent). When in OK status, Nagios will reschedule checks every 3 minutes.

Hope this helps :slight_smile:


#7

I followed your suggestions. But still cannot get servieces checked …

Check_Backup_2 PENDING N/A 0d 22h 20m 42s+ 1/1 Service is not scheduled to be checked…

If i manually try to “Re-schedule the next check of this service”, service won’t be checked anyway. And is not present in the scheduling queue…

I will try to play again with those values, but it seems that the problem is with services with custom timeperiods.

I want a service check daily only ONCE a day at 14pm
define timeperiod{
timeperiod_name 24x5_14am
alias 24 Hours A Day, 5 Days A Week , 14am
monday 14:00-14:01
tuesday 14:00-14:01
wednesday 14:00-14:01
thursday 14:00-14:01
friday 14:00-14:01
}

service is so defined, now:
define service{
use generic-service
host_name AS400DEV
service_description Check_Backup_2
check_period 24x5_14am
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
normal_check_interval 1
check_command check_file!’/mnt/flr_as400dev/ok_bkp_dev2.txt’!
}

A question… those max_check_attempts normal_check_interval
retry_check_interval values should be put in the template file or in the service definition ?
Thanks again…


#8

How did you set up the “ORPHANED HOST/SERVICE CHECK OPTIONS” variables in nagios.cfg?
Those should be set to one if something like your scenario happens:
check_for_orphaned_services=1
check_for_orphaned_hosts=1

See if that helps.


#9

my values are …

check_for_orphaned_services=1
check_for_orphaned_hosts=1

I will try to do more attempts . Unluckily i cannot know how to “debug”, but i will “play” with cfg files.


#10

It’s unbelievable but checks still don’t work…
A strange thing is that check “anticipate” a bit : It’s scheduled for example at 8am but in nagios i see:
Service check scheduled for Mon Nov 24 07:59:57 CET 2008
Or
Service is not scheduled to be checked…

I ask to you all…-
If you need to check a service ONLY ONC A DAY from monday to friday at 8am…how would you do that ??
I am giving up…
Thanks and regards


#11

Check for these setting in nagios.cfg. They should be set as I’ve specified to test that problem of yours. Of course, first read the explanation of each to be sure you want to enable those and set them the way I’ve specified.

retain_state_information=1 use_retained_scheduling_info=1 service_inter_check_delay_method=s

If you don’t have many services that are scheduled often, then set this to 0

And let’s go over some things again just to be sure.
You are saying that nothing is written in nagios.log for those serivces?
Are you sure those services are set to active_checks_enabled 1?

Post us your template you’re using for the problematic services. And post us the command definition.
Also check if the check plugin maybe hangs or goes out with the timeout but beyond the timeperiod set for service to be checked.

If that doesn’t help, let’s go deeper to the root of the problem.
First, in what state is your server? Do you happen to have to much CPU usage on the server, maybe too many forks… check those. How many services do you have configured in Nagios?
Have you tried with reinstalling Nagios and Nagios-plugins?

For the service with check once a day it is enough to define a timeperiod of about 3 minutes and set the variables like normal_check_interval to about 1 minute and retry_check_interval also to 1 with the max_check_attempts 1. That way services should be checked within those 3 minutes. Why is that not happening in your case, is really a mistery to me for now. Hope we will solve it.


#12

retain_state_information=1 use_retained_scheduling_info=1 service_inter_check_delay_method=s

It’s exactly what there is in my nagios.cfg

It’s exactly what there is in my nagios.cfg

You are saying that nothing is written in **nagios.log** for those serivces? Are you sure those services are set to active_checks_enabled 1?

Exactly : nothing in nagios.log is written , and that value is set to 1

Post us your template you’re using for the problematic services. And post us the command definition.
Also check if the check plugin maybe hangs or goes out with the timeout but beyond the timeperiod set for service to be checked.

define timeperiod{ timeperiod_name 24x5_8am alias 24 Hours A Day, 5 Days A Week , 8am monday 08:00-08:03 tuesday 08:00-08:03 wednesday 08:00-08:03 thursday 08:00-08:03 friday 08:00-08:03

define service{ use generic-service host_name AS400DEV service_description Check_DskSts check_period 24x5_8am max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 check_command check_file!'/mnt/flr_as400dev/ok_dsk_dev.txt'! }

The check_file plugin is used also for other periodical checks (not with custom timeperiods, anyway)
And if i specify another check_period with a custom one (example timeperiod_name 24x7, service is checked)

First, in what state is your server? Do you happen to have to much CPU usage on the server, maybe too >many forks… check those. How many services do you have configured in Nagios?
Have you tried with reinstalling Nagios and Nagios-plugins?

Server is not under press, absolutely. Moreober, i cannot understand why ONLY those checks with
custom timeperiod (check once a day) give me problems !
I updated nagios, from 3.03 to 3.05. Not reinstalled. Should I ? Uninstall and reinstall ?

For the service with check once a day it is enough to define a timeperiod of about 3 minutes and set the >variables like normal_check_interval to about 1 minute and retry_check_interval also to 1 with the >max_check_attempts 1. That way services should be checked within those 3 minutes. Why is that >not happening in your case, is really a mistery to me for now. Hope we will solve it.

I set values as you suggested.
Tomorrow morning i will check again, but i am sure i will get nothing again, just a “no service is scheduled”…
Now service is
Check_DskSts

PENDING 	N/A 	0d 0h 9m 55s+ 	1/2 	Service check scheduled for Tue Nov 25 08:00:00 CET 2008 

I’ll post tomorrow how it will be.
But i am pessimistic…

Thanks one more time for your patience and kindness
Regards


#13

Of course checks did not work. Here it is situation …
I give up !


Check_DskSts
PENDING N/A 0d 14h 46m 23s+ 1/2 Service check scheduled for Wed Nov 26 08:00:00 CET 2008

Check_DskSts
PENDING N/A 0d 14h 48m 6s+ 1/2 Service check scheduled for Wed Nov 26 08:00:00 CET 2008

Check_Vision
PENDING N/A 0d 14h 49m 43s+ 1/2 Service check scheduled for Wed Nov 26 08:00:00 CET 2008

Check_Backup_Notte
PENDING N/A 0d 14h 51m 16s+ 1/1 Service is not scheduled to be checked…

Of course in nagios.log there is absolutely nothing regardind such services.

In the scheduling queue I can see that service should be checked, but they will not be executed…


AS400DEV Check_Backup_2 N/A 11-25-2008 13:59:54 Normal ENABLED


#14

Someone should explain me this…
Reading message of other users, i tried in this way:

define timeperiod{
timeperiod_name 24x5_14am
alias 24 Hours A Day, 5 Days A Week , 14am
monday 13:00-13:59
tuesday 13:00-13:59
wednesday 13:00-13:59
thursday 13:00-13:59
friday 13:00-13:59

define service{
use generic-service
host_name AS400DEV
service_description Check_Backup_2
check_period 24x5_14am
check_interval 20
check_command check_file!’/mnt/flr_as400dev/ok_bkp_dev2.txt’!
}

define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 10
retry_check_interval 2
contact_groups admins
notification_options w,u,c,r
notification_interval 60
notification_period none

notification_period workhours

     register                        0                      

And this is what happened …
Current Attempt: 1/1 (HARD state)
Last Check Time: 25-11-2008 13:39:51
Check Type: ACTIVE
Check Latency / Duration: 0.146 / 0.811 seconds
Next Scheduled Check: 26-11-2008 12:59:56
Last State Change: 25-11-2008 12:59:51

So … HOW CAN I MAKE THIS SERVICE BE CHECKED AT 2pm DAILY ???


#15

The only thing you’re missing here is that you have defined your timeperiods from 13:00-13:59. Those Next Scheduled Check: 26-11-2008 12:59:56 and Last State Change: 25-11-2008 12:59:51 would be moved one hour forward if you have defined timeperiods from 14:00-14:59. To get it checked only once through that hour set normal_check_interval and retry_check_interval to 60.


#16

Still nothing …
Trying to set timeperiod:
define timeperiod{
timeperiod_name 24x5_14am
alias 24 Hours A Day, 5 Days A Week , 14am
monday 14:00-14:59
tuesday 14:00-14:59
wednesday 14:00-14:59
thursday 14:00-14:59
friday 14:00-14:59

define service{
use generic-service
host_name AS400DEV
service_description Check_Backup_Notte
check_period 24x5_8am
check_interval 20
normal_check_interval 60
retry_check_interval 60
check_command check_file!’/mnt/flr_as400dev/ok_bkp_dev.txt’!
}

And the service seems to be scheduled correctly :
Service check scheduled for Thu Nov 27 13:59:39 CET 2008
But service is not executed:
Last Check N/A
Service State Information

This service has not yet been checked, so status information is not available.

:frowning:


#17

I reckon it would be easier to schedule this outside of nagios and submit the result passively…


#18

But it is an Iseries (AS400) server, not linux or windows. Passive check is possible also for an Iseries ??( NO nsca client…)


#19

So, in the end i found what was the problem…
My nagios installation is on a VM guest, and it is a member of a windows active directory domain.
This VM date/time must be synchronized periodically with the Domain Controller (5 minutes is the limit). Because the clock of VM is not correct, losing minutes and minutes.
So in crontab i inserted a line that every minute synchronize (with ntpdate) the date/time.
And … i disabled this synchronization and … bingo ! Now every service is checked always without problem !
It was not a problem of timeperiods, service definitions !
Now i have to find a way to make time syncronization work, without making nagios playing the fool again. A solution would be a scheduled job that stop nagios && ntpdate && start nagios.
Thanks to everyone for trying to help !
Regards


#20

Nah, I meant still run the check from the nagios server, but run the check script regularly from another script that is scheduled using cron, and use your cron script to parse and submit the check_script’s results straight into nagios via the use of external commands.

Any road up it’s a non-issue now, but glad you got it sorted.