Nsca spawning out of control

brian89gp · January 14, 2006, 12:58pm

About once a day on my central nagios server the nsca daemon will suddenly start spawning until it brings the server down. Up until the point that this starts everything works fine, then something happens and about an hour later there are several thousand nsca proccesses running and the server locks up. Is there any cause for this?

There are about 1000 services on 400 hosts, 11 other nagios servers reporting into this one. It is a dual P3 box with 4 gigs of ram. nsca and Nagios start via the init scripts, there is NOT a cron job that restarts them.

jakkedup · January 14, 2006, 2:10pm

1000 servers is not too many since I"m running more than that with no trouble. So, what does your /etc/xinetd.d/nsca file look like? MIne…

default: on

    # description: NSCA
    service nsca
    {
            flags           = REUSE
            socket_type     = stream
            wait            = no
            user            = nagios
            group           = nagios
            server          = /usr/local/nagios/bin/nsca
            server_args     = -c /usr/local/nagios/etc/nsca.cfg --inetd
            cps             = 9000 30
            instances       = UNLIMITED
            log_on_failure  += USERID
            disable         = no
            only_from       =  bunchofipaddy's

}
Have you looked at your system log file for errors about nsca?
What is your service check normal_check_interval = ?? You may want to back off on it, if you have it under 5 and your interval is 60. What is your interval if it’s not default of 60.

jakkedup · January 14, 2006, 2:11pm

ps, what is command_check_interval=?? in nagios.cfg?

brian89gp · January 15, 2006, 9:30am

I am running nagios 2.0b4

the command_check_interval on all of the remote nagios servers is set at 1.

I could not find the normal_check_interval option. I do know that the host and service checks are set to 5 minutes.

I don’t have a /etc/xinetd.d/nsca file. a “nsca -c nsca.cfg -d” is ran from an rc.d script but that is it.

The only thing in the logs about nsca are connection accepted messages.

jakkedup · January 15, 2006, 6:32pm

normal_check_interval is in your services.cfg

Holy moly that’s what is wrong. In the nagios.cfg…

EXTERNAL COMMAND CHECK INTERVAL

This is the interval at which Nagios should check for external commands.

This value works of the interval_length you specify later. If you leave

that at its default value of 60 (seconds), a value of 1 here will cause

Nagios to check for external commands every minute. If you specify a

number followed by an “s” (i.e. 15s), this will be interpreted to mean

actual seconds rather than a multiple of the interval_length variable.

Note: In addition to reading the external command file at regularly

scheduled intervals, Nagios will also check for external commands after

event handlers are executed.

NOTE: Setting this value to -1 causes Nagios to check the external

command file as often as possible.

#command_check_interval=1
#command_check_interval=15s
command_check_interval=-1

NOTICE, that is a MINUS ONE, not just a ONE. With yours set to 1, that means you check the file every one minute. With alot of passive checks going on, you want that file to be practically empty at all times, and yours must be tremedously full of data waiting to be processed by nagios every minute.

brian89gp · January 16, 2006, 6:40am

My mistake, that command_check_interval is a -1. Seems I can’t read straight anymore…

The normal_check_interval is set to 5

It happened again while I was watching. At first the web interface starts to time out randomly and the RAM jumps to full usage (4 gig). After about an hour the thing starts swapping real bad and the nsca processes start to pile up.

Just speculation, could the excessive disk swapping cause the nsca processes to backlog thus causing more memory usage and swapping and performing a self defeating circle?

One strange thing, for a while this was being run on a Mandrake 10 box and it never had this locking up problem. Due to other reasons we are using Fedora 3 and both times we have used this Linux OS it has done this.

This box is also running perfparse with a mysql backend, could this have anything to do with it?

g00s3m4n · January 16, 2006, 8:37am

hello all…

I have the same problem, my settings are the same as brian89gp’s

I am running nagios 1.2 on Redhat 8.

jakkedup · January 16, 2006, 6:56pm

Maybe we need to change your logging. Below are some items that you really shouldn’t change from what I have. Unless you have an absolute need to see a check show in your system log, etc.

nagios.cfg
use_syslog=0
log_notifications=0
log_service_retries=0
log_host_retries=0
log_passive_service_checks=0
inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10
sleep_time=1
aggregate_status_updates=1
status_update_interval=15

brian89gp · January 17, 2006, 6:56am

the differences that I have:
log_notifications=1
log_service_retries=1
log_host_retries=1
sleep_time=0.25

did not have:
log_passive_service_checks

I will change those settings and see if it makes a difference

g00s3m4n · January 17, 2006, 5:48pm

my differences:

use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1

I am making the changes and will check it out…

your help is much appreciated.

g

jakkedup · January 19, 2006, 3:49pm

sleep_time=.25 I think is a big mistake. Did you change that value due to some EXPERT like the developer of nagios telling you to do that? In the nagios.cfg file, it says that you shouldn’t change that value from the default of 1. Anyway, good luck. After you fix the problem, you may want to turn on logging for any of the items you turned off. But I see no reason to log to syslog, since you already have logging in nagios’ own log.

brian89gp · January 21, 2006, 3:37pm

No idea why it was set like that, but it is still crashing though it takes a little longer.

On this next reinstall I am going to compile it with mysql support. The nagios.log grows to approximately 1.4gb a week!

What flavor and version of Linux are you using? I want to avoid going back to Mandrake if possible.

jakkedup · January 21, 2006, 6:33pm

CentOS, or if you prefer, a GNU type of Redhat es.
You don’t have to log stuff you know. Just turn it off in nagios.cfg.

brian89gp · January 22, 2006, 3:52am

Unfortunately I do have to log everything due to reporting purposes.

Hopefully CentOS will work, I actually was downloading 4.2 already to try.

Thanks again for the help
Brian

brian89gp · January 22, 2006, 11:54am

I think I finally found the cause of the problem. The Nagios process seems to be stopping and when it does the nsca processes start stacking up in backlog, after 30 minutes there are 20,000+ nsca processes and the machine stops.

jakkedup · January 22, 2006, 2:02pm

You most likely did this, but did you reboot lately?
The only time nagios will stop, kind of, is when a service check fails and it then starts to hammer the host with a ping, to see what is wrong with this host. During that host ping, nagios stops doing ANYTHING else. So take a look and you might see that you have a problem host. When a host DOES GO DOWN, in your situation you MUST DISABLE active checks of that service.

You might want to back off on how often you check things too, so that nagios has some extra time to work with. But you only have 1000 services, so really, that shouldn’t be a problem. But I still fail to see why you log to syslog, when you already are logging to nagios logs. You really need to see that a service has retry attempts for a check?
Edited Sat Jan 21 2006, 09:42PM ]

jakkedup · January 22, 2006, 2:48pm

PS, you stated something about the remote host services are checked every 1 interval? Seems like a bit of over kill. That could make it pretty hard on your system, since that would effectively increase your checks to 5 times of what I have.

Try backing off on your check interval, make it 5, and see if that helps. Think about it. You have several hosts dumping 100’s of checks into your command file every minute.

brian89gp · January 22, 2006, 4:35pm

Have been hard rebooting it at least once a day for the past 2 weeks as that is how often it locks up. This is a central nsca server so there are no active checks being run from it.

All of the remote nagios servers do have their check interval set to 5 minutes. I turned syslog off logging earlier in the thread, still crashed.

This all started about the same time that perfparse was installed. Also, it is running with a large reduction in the amount of RAM used since I am restarting nagios every 15 minutes (1gb vs 4gb). I have to let it run a couple days to make sure it doesn’t continue to happen since it seems to be rather random.

g00s3m4n · January 24, 2006, 2:31pm

Hello…

I am still at a loss as well… seems like I am having the same problem, too many nsca processes, box runs out of memory and falls over.

jakkedup · January 26, 2006, 4:11pm

OK, let’s start over. Post your configs for both the daemon and the client. Post your command definition used to submit the results.
Nagios Server files:
/usr/local/nagios/etc/nsca.cfg:
server_port=5667
server_address=xx.xx.xx.xx
nsca_user=nagios
nsca_group=nagios
debug=0
command_file=/usr/local/nagios/var/rw/nagios.cmd
alternate_dump_file=/usr/local/nagios/var/rw/nsca.dump
aggregate_writes=0
append_to_file=0
max_packet_age=30
password=xxxxx
decryption_method=0

/etc/xinetd.d/nsca:
service nsca
{
flags = REUSE
socket_type = stream
wait = no
user = nagios
group = nagios
server = /usr/local/nagios/bin/nsca
server_args = -c /usr/local/nagios/etc/nsca.cfg --inetd
cps = 9000 30
instances = UNLIMITED
log_on_failure += USERID
disable = no
only_from = xx.xx.xx.xx yy.yy.yy.yy zz.zz.zz.zz
}

Remote clients files:
/usr/local/nagios/etc/send_nsca.cfg:
password=xxx
encryption_method=0

/usr/local/nagios/etc/services.cfg
active_checks_enabled=1
obsess_over_service=1
/usr/local/nagios/etc/nagios.cfg:
obsess_over_services=1
ocsp_command=submit_check_result

/usr/local/nagios/etc/checkcommands.cfg
define command{
command_name submit_check_result
command_line $USER1$/eventhandlers/submit_check_result $HOSTNAME$ ‘$SERVICEDESC$’ $SERVICESTATE$ ‘$OUTPUT$|$PERFDATA$’
}

/usr/local/nagios/libexec/eventhandlers/submit_check_result:
#!/bin/sh

    # Arguments:
    #  $1 = host_name (Short name of host that the service is
    #       associated with)
    #  $2 = svc_description (Description of the service)
    #  $3 = state_string (A string representing the status of
    #       the given service - "OK", "WARNING", "CRITICAL"
    #       or "UNKNOWN")
    #  $4 = plugin_output (A text string that should be used
    #       as the plugin output for the service checks)
    #

    # Convert the state string to the corresponding return code
    return_code=-1

    case "$3" in
        OK)
                    return_code=0
                ;;
            WARNING)
                return_code=1
                    ;;
            CRITICAL)
                return_code=2
                    ;;
            UNKNOWN)
                return_code=-1
                    ;;
    esac

    # pipe the service check info into the send_nsca program, which
    # in turn transmits the data to the nsca daemon on the central
    # monitoring server

    /bin/echo "$1\\t$2\\t$return_code\\t$4\\n" | /usr/local/nagios/bin/send_nsca ipaddyofnagiosserverxx.xx.xx.xx -c /usr/local/nagios/etc/send_nsca.cfg

The above means that of course send_nsca is located in /usr/local/nagios/bin

Also, when posting on this forum, you have to use 2 backslashes to equal one backslash. So, if you want to show one backslash like this \ I have to actually type 2 of them. So please do the same or I’ll just complain about your syntax.