Too many nagios daemons - server crashing

drragon · July 6, 2005, 2:42pm

Hi all

I am running nagios 1.2 on a well specced central server under RH EL3 which is collecting data from remote nagios servers using nsca/oscp.

A total of 3000+ checks are carried out by 5 remote servers

The nagios status.log on the central server is usually up to date but these results are not shown in the CGI

After a reboot (or each time nagios is restarted) the total number of “nagios -d” processes starts at 3 but gradually gets higher and higher until it reaches 1500+

Server doesn’t show signs of high load average but eventually it keels over and dies - needing rebooting

Ive tried adjusting possibly relevant parameters in nagios.cfg with no luck at all.

Can anyone help please?

graemann · July 7, 2005, 3:01am

Do you have a script that automatically restart your nagios daemon? Maybe its in a continues loop.

luca · July 7, 2005, 2:39pm

seems like some check is hanging and the childs don’t return…
try this…
take the PID of one of the nagios -d processes and grep the porcess list for it. does it return some defunct processes?

i’m having a porblem wiht hagios hanging with this situation so it could be useful

Thanks, Luca

drragon · July 7, 2005, 4:12pm

The number of nagios daemons now seems under control by means of an hourly script.

The status.log file contains reasonably up-to-date information but this is not being displayed on the web page which still has entries 2 or 3 days old. So it is still not working correctly.

Thanks for the feedback so far though 8)

drragon · July 7, 2005, 4:12pm

The number of nagios daemons now seems under control by means of an hourly script.

The status.log file contains reasonably up-to-date information but this is not being displayed on the web page which still has entries 2 or 3 days old. So it is still not working correctly.

Thanks for the feedback so far though 8)

drragon · July 7, 2005, 4:13pm

The number of nagios daemons now seems under control by means of an hourly script.

The status.log file contains reasonably up-to-date information but this is not being displayed on the web page which still has entries 2 or 3 days old. So it is still not working correctly.

Thanks for the feedback so far though 8)

jakkedup · July 7, 2005, 4:22pm

Settings in nagios.cfg should be:
inter_check_delay_method=s
service_interleave_factor=s
and this one, that I suspect you should change from:

MAXIMUM CONCURRENT SERVICE CHECKS

This option allows you to specify the maximum number of

service checks that can be run in parallel at any given time.

Specifying a value of 1 for this variable essentially prevents

any service checks from being parallelized. A value of 0

will not restrict the number of concurrent checks that are

being executed.

max_concurrent_checks=0
You might want to change that to:
max_concurrent_checks=100 or maybe 800 or so. Tune it to what works best for you.

drragon · July 7, 2005, 4:54pm

Thanks jakkedup

I have set the max_concurrent_checks=400 but the CGI is still not updating all of the service results

drragon · July 7, 2005, 4:54pm

Thanks jakkedup

I have set the max_concurrent_checks=400 but the CGI is still not updating all of the service results

drragon · July 7, 2005, 4:54pm

Thanks jakkedup

I have set the max_concurrent_checks=400 but the CGI is still not updating all of the service results

jakkedup · July 8, 2005, 4:49pm

disable and then re-enable the checks by using the webpage. You might try and force a check also.

drragon · July 11, 2005, 9:44am

Thanks for the posts.

I have taken 1800 of the checks off this server and built a second server to handle these. This solved all problems immediately.

I guess there is a limit coded somewhere but as I don’t know what it is I am guessing when I say it could be 2000 or 2500. Anyway each central server is now handling about 1800 checks and everything is fine.

Going to try to install multiple instances of nagios on the one server next…

Thanks again for the suggestions.

PS - does anyone know why one posting from me ends up displayed 3 times here?

drragon · July 11, 2005, 9:44am

Thanks for the posts.

I have taken 1800 of the checks off this server and built a second server to handle these. This solved all problems immediately.

I guess there is a limit coded somewhere but as I don’t know what it is I am guessing when I say it could be 2000 or 2500. Anyway each central server is now handling about 1800 checks and everything is fine.

Going to try to install multiple instances of nagios on the one server next…

Thanks again for the suggestions.

PS - does anyone know why one posting from me ends up displayed 3 times here?

drragon · July 11, 2005, 9:44am

Thanks for the posts.

I have taken 1800 of the checks off this server and built a second server to handle these. This solved all problems immediately.

I guess there is a limit coded somewhere but as I don’t know what it is I am guessing when I say it could be 2000 or 2500. Anyway each central server is now handling about 1800 checks and everything is fine.

Going to try to install multiple instances of nagios on the one server next…

Thanks again for the suggestions.

PS - does anyone know why one posting from me ends up displayed 3 times here?

jakkedup · July 11, 2005, 1:29pm

So tell us what you had before and whay you have now. Just for clarity.
So you had +3000 service checks and how many where passive? Could you may have fixed it by using distributed servers? But it sounds like you already are using distributed server’s. Seems like you have achieved what I have been triing to do, put nagios to it’s knees and beg for mercy.

drragon · July 11, 2005, 3:51pm

For the time being I have built another server with another instance of nagios on it and spread the checks among the 2.

About 95% of the checks done by these 2 are passive checks.

I’m presently redesigning it all so that one server runs multiple instances of nagios, splitting them that way instead but this will take a little design effort.

FYI I have over 4500 active checks running on a nagios instance which services our internal data centre so the limit I experienced does seem to be related to passive checks.

Still getting 3 posts on here for the price of one!

drragon · July 11, 2005, 3:51pm

For the time being I have built another server with another instance of nagios on it and spread the checks among the 2.

About 95% of the checks done by these 2 are passive checks.

I’m presently redesigning it all so that one server runs multiple instances of nagios, splitting them that way instead but this will take a little design effort.

FYI I have over 4500 active checks running on a nagios instance which services our internal data centre so the limit I experienced does seem to be related to passive checks.

Still getting 3 posts on here for the price of one!

drragon · July 11, 2005, 3:51pm

For the time being I have built another server with another instance of nagios on it and spread the checks among the 2.

About 95% of the checks done by these 2 are passive checks.

I’m presently redesigning it all so that one server runs multiple instances of nagios, splitting them that way instead but this will take a little design effort.

FYI I have over 4500 active checks running on a nagios instance which services our internal data centre so the limit I experienced does seem to be related to passive checks.

Still getting 3 posts on here for the price of one!

jakkedup · July 11, 2005, 7:27pm

You got a virus or what?

Man, that’s one heck of alot of nagios configuration. You must be at least on the top 5 users of nagios, if not the top record holder.