Inconsistent freshness checking


#1

Hi guys,

I’ve run into some strangeness concerning freshness checking.

A few of our passively checked services have freshness checks enabled, but these report a “stale” status, even if there is none.

See the following log snippet below. You’ll see messages coming in through NSCA, telling the Nagios server about the (new) status for a service. But for some reason Nagios still decides that no new messages have come in, thus breaking the freshness limit.

[1142264152] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;box1;JBOSS;0;Datasource (JBOSS) Application server connection alive. Description: 'jboss/box1' (time=Mon Mar 13 16:35:51 CET 2006)
[1142264153] SERVICE ALERT: box1;JBOSS;OK;SOFT;2;Datasource (JBOSS) Application server connection alive. Description: 'jboss/box1' (time=Mon Mar 13 16:35:51 CET 2006)
[1142264178] SERVICE ALERT: box1;JBOSS;CRITICAL;SOFT;1;CRITICAL: Data does not appear to be fresh anymore
[1142264213] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;box1;JBOSS;0;Datasource (JBOSS) Application server connection alive. Description: 'jboss/box1' (time=Mon Mar 13 16:36:51 CET 2006)
[1142264215] SERVICE ALERT: box1;JBOSS;OK;SOFT;2;Datasource (JBOSS) Application server connection alive. Description: 'jboss/box1' (time=Mon Mar 13 16:36:51 CET 2006)
[1142264246] SERVICE ALERT: box1;JBOSS;CRITICAL;SOFT;1;CRITICAL: Data does not appear to be fresh anymore

Here’s the service definition for the service in question. The “no-freshness” check_command is basically a shell script which yells that the service’s status is stale.

define service{
       use                             generic-diensten
       host_name                       box1
       service_description             JBOSS
       contact_groups                  systems
       check_command                   no-freshness
       freshness_threshold             1800
       check_freshness                 1
       active_checks_enabled           0
       passive_checks_enabled          1
       }

There are other services which have the same problem. They’re also set up in the same way.

So the weird thing is that, while service check results came in at X152 and X153, Nagios still claims that the service is stale at X178. Which is only a time difference of 25 seconds! Not nearly enough to trigger the 1800 second limit.

Does anyone have a guess at to what could be wrong?


#2

from what i see the hostname is different… “nam chenaged to protect the innocent” :smiley: vs box1… they need to be the same.

Luca

PS: post edited… the server nam ewas no issue as specified in next post :slight_smile:


#3

My bad :slight_smile: that’s actually the name I was trying to hide, by renaming it to box1.

Hence: the names are the same…


#4

whoops. ok, i’ll edit my post too… just to keep it away… :slight_smile:

is the normal check interval high enough? to be honest i’m not really sure how the freshness checks work so i’m going a bit by trial here… but i got my to work… :slight_smile:

Luca


#5

Wow! The layout for this forum just keeps on changing, doesn’t it? o_O

Anywho: yeah, as far as I know all of the settings are correct, but I’ll take another look at it to make sure. Thanks for the suggestion.

EDIT:
Anywho, I did some more checking and I really am quite stumped…

Once more, the service description is as follows:

define service{
        name                            generic-diensten
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        notification_interval           120
        notification_period             24x7
        notification_options            c,r,w
        register                        0
        }

define service{
       use                             generic-diensten
       host_name                       box2
       service_description             service2
       contact_groups                  group2
       check_command                   no-freshness
       freshness_threshold             500
       check_freshness                 1
       active_checks_enabled           0
       passive_checks_enabled          1
}

The point is: as far as I know the whole “normal_check_interval” deal is completely ignored if you submit your own freshness_threshold. Only when you omit it, should Nagios start using the X_check_interval values.

Just for safety I tried adding a normal_check_interval value which was larger than the freshness_threshold. This didn’t have any effect.

Strangeness! :shock:


#6

Well, if this ain’t screwed up I don’t know what is :stuck_out_tongue:

The following seems to have helped:

  1. Remove the service definitions completely.
  2. Restart nagios
  3. Re-add the service definitions, but only as passive checks. Leave out the freshness bits.
  4. Restart nagios.
  5. Re-add the freshness bits.
  6. Restart nagios.

Fscked up, right? I’ll see how things hold up through the night…

This really is quite odd :confused:


#7

Yep, it seems that this re-shuffling of the definitions has fixed the whole buggy-ness… Weird… :]


#8

Fux0r…

The problem is back, with a vengeance… Now a huge amount of services have started reporting as being “stale”, even though they aren’t. For example:

I’ll keep on searching for the cause and solution to this problem. In the mean time all your suggestions are more than welcome!


#9

freshness_threshold 500 is in seconds so it’s about 8 minutes.
Is your remote data coming in faster than that?
If so, make sure the data is acutally getting to Central server, and has the exact same hostname/service description.


#10

Yah, the answers were coming in every five minutes. The log files even show them coming in. That’s what freaked me out.