Scheduling Problem

Hi!

I recently got nagios set up and installed on a Server Running AIX 5.3. The install went (more or less) smoothly, and I got things configured and running. Things look good at first when I start / restart nagios.

I’m doing very simple checks, using check_ping as the “service” for each host, and check-host-alive to do the host checks.

The problem I’m having is that once I have a host go down, the scheduling gets “off” and several hosts don’t get checked in time, and drop out of the queue. They never get rescheduled… an obvious problem.

I’ve tried changing the scheduling parameters around, and going through the docs & following what it says (including using the “smart” option for the inter-check delay), but I can’t find a way around this.

Here’s the output of the -s switch:

Nagios 2.0b3
Copyright © 1999-2005 Ethan Galstad (www.nagios.org)
Last Modified: 04-03-2005
License: GPL

Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION

Total hosts: 246
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A

SERVICE SCHEDULING INFORMATION

Total services: 246
Total scheduled services: 246
Service inter-check delay method: USER-SUPPLIED VALUE
Inter-check delay: 0.30 sec
Interleave factor method: SMART
Average services per host: 1.00
Service interleave factor: 1
Max service check spread: 30 min
First scheduled check: Mon May 23 16:42:41 2005
Last scheduled check: Mon May 23 16:43:54 2005

CHECK PROCESSING INFORMATION

Service check reaper interval: 10 sec
Max concurrent service checks: 100

PERFORMANCE SUGGESTIONS

I have no suggestions - things look okay.



Here’s my services.cfg:
define service{
host_name *
service_description PING
contact_groups admins,on_call_egh
notification_options c,w,r
check_command check_ping
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 0
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 0
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
check_period 24x7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
notification_interval 0
notification_period 24x7
}



Here’s my “standard” host config (used as an include):
define host{
name windows_default
max_check_attempts 2
notification_interval 0
notification_period 24x7
notification_options d,r
check_command check-host-alive
register 0
}

Thanks for any help!

  • Tony

Anyone have any ideas on this? I’ve been doing a lot of reading, and I can find in the docs where it says that when a host check is triggered, no other checks are performed until it completes. However it does not mention that other service checks get “left behind”. As it stands now, every time I have a host go down I lose 5 - 10 service checks that do not get checked in time and then never get rescheduled.

Any help is appreciated.

  • Tony

Sounds to me like you just don’t have enough of CPU/memory to handle this. Perhaps change your interval to 5 minutes, instead of 3 minutes.

This is running on a dual power4 based IBM box, with 4 gb of ram… tons of free ram, and very idle CPUs.

I’ve played around with the interval, and no matter what I do if a service check is scheduled during the time a host check runs, it gets pre-empted by the host check (which is normal according to the docs) and the “next check time” is therefore in the past. And then it never gets rescheduled.

For instance (a rough example):

Service A is scheduled to check at 8:00:00
Service B is scheduled to check at 8:00:05
Service C is scheduled to check at 8:00:10
Service D is scheduled to check at 8:00:15

Service A checks fine, gets rescheduled for 8:00:20.
Service B fails its check, and triggers a host check that takes 6 seconds. The host check completes at 8:00:11, and host B is UP. (or down, doesn’t matter). Service B is checked based on its parameters in the services config.

Service D is checked at 8:00:15 and rescheduled.

Service C is still showing as scheduled to check at 8:00:05… a time in the past. It never gets checked again, never gets rescheduled.

For now, turn on service freshness checking and set it to 6 minutes, if your checks are occuring every 3 minutes. At least until we get this figured out.

I turned on freshness checks - doesn’t seem to help.


west_wing_8600 PING 06-03-2005 10:48:45 06-03-2005 10:51:45
SW-FPA-172.16.208.33 PING 06-03-2005 12:00:33 06-03-2005 12:03:33

Enabling the “check_for_orphaned_services” option seems to have resolved the issue, going by my preliminary testing.

Seems I was too quick to declare victory… making progress, though.

Host _________________ Service ___ Last Check ____________ Next Check
EGH_EXCH1 __________ PING _____ 06-07-2005 11:36:55 ____ 06-07-2005 13:03:30
13D_CCE-172.16.38.30 __PING ______ 06-07-2005 11:25:51 ___ 06-07-2005 13:03:30
13C_CC-172.16.32.36 ___ PING _____ 06-07-2005 11:25:51 ___ 06-07-2005 13:03:30

(Data is copied from the web page at 13:12)


Sorry for the formatting, hard to paste. As you can see, services scheduled hours before are now getting rescheduled, but the check is still failing.

I see the following in the event log:
[06-07-2005 13:03:30] Warning: The check of service ‘PING’ on host ‘EGH_EXCH1’ looks like it was orphaned (results never came back). I’m scheduling an immediate check of the service…
06-07-2005 13:03:30] Warning: The check of service ‘PING’ on host ‘13D_CCE-172.16.38.30’ looks like it was orphaned (results never came back). I’m scheduling an immediate check of the service…
[06-07-2005 13:03:30] Warning: The check of service ‘PING’ on host ‘13C_CC-172.16.32.36’ looks like it was orphaned (results never came back). I’m scheduling an immediate check of the service…

I’m also occasionally seeing these:
[06-07-2005 12:41:37] SERVICE ALERT: 14A_BELEV-172.16.20.30;PING;UNKNOWN;SOFT;1;Could not open pipe: /etc/ping -n -c 2 172.16.20.30

But those don’t occur for every host that does not get rescheduled, or run correctly.

I don’t see any events at all that appear to be related to the “last run time” of the services.

As an example, the last entry that was not an “orphaned” entry for the first host on the above list was:
[06-07-2005 10:45:03] SERVICE ALERT: EGH_EXCH1;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.00 ms

That’s well before the “last run time”, and again these run a check every 3 minutes.

Try changing to 5 minutes. Someone else reported problems with ping, and it turned out to be some ping flood check, that some hosts have. So the host you are triing to ping, thinks that you are attempting to flood it. Perhaps changing the packets sent, etc in the definition of check_ping would help. You might want to use check_fping, instead of check_ping also, since check_fping is actually a hair quicker.