Nagios service go down

litos79 · February 14, 2010, 10:14am

Hello I run nagios under linux, since few days ago the service nagios start correctly (when i verify the configuration no problems appear), but after one or two minutes the service go down.

What must i do to solve this problem ?

The mysql service is up and there is a lot of disk space in the computer running nagios.

Bye
Manuel

luca · February 14, 2010, 1:26pm

what does the nagios log say?

litos79 · February 15, 2010, 8:22am

Yes, I try but the problem persist.

is there any log that i can check ?

I am new in Nagios

Thanks,
Manuel

litos79 · April 20, 2010, 9:10am

[1271752164] Nagios 3.2.0 starting… (PID=31781)
[1271752164] Local time is Tue Apr 20 10:29:24 CEST 2010
[1271752164] LOG VERSION: 2.0
[1271752164] ndomod: NDOMOD 1.4b8 (07-16-2009) Copyright © 2009 Nagios Core Development Team and Community Contributors
[1271752164] ndomod: Could not open data sink! I’ll keep trying, but some output may get lost…
[1271752164] Event broker module ‘/usr/local/nagios/bin/ndomod.o’ initialized successfully.
[1271752164] Warning: Host ‘Datacenter-link’ has no services associated with it!
[1271752164] Lockfile ‘/usr/local/nagios/var/nagios.lock’ looks like its already held by another instance of Nagios (PID 31673). Bailing out…
[1271752164] ndomod: Shutdown complete.
[1271752164] Event broker module ‘/usr/local/nagios/bin/ndomod.o’ deinitialized successfully.
[1271752190] SERVICE ALERT: AP0092IT;PING;OK;HARD;3;OK - 10.172.52.90: rta 54.9 ms, lost 0%
[1271752190] SERVICE ALERT: AP0089IT;PING;CRITICAL;SOFT;1;CRITICAL - 10.172.33.90 is down (lost 100%)
[1271752244] Error: Cannot open main configuration file ‘/usr/local/nagios/.v’ for reading!
[1271752244] Nagios 3.2.0 starting… (PID=31948)
[1271752244] Local time is Tue Apr 20 10:30:44 CEST 2010
[1271752244] LOG VERSION: 2.0
[1271752244] Bailing out due to one or more errors encountered in the configuration files. Run Nagios from the command line with the -v option to verify your config before restarting. (PID=31948)
[1271752865] Nagios 3.2.0 starting… (PID=32027)
[1271752865] Local time is Tue Apr 20 10:41:05 CEST 2010
[1271752865] LOG VERSION: 2.0
[1271752865] ndomod: NDOMOD 1.4b8 (07-16-2009) Copyright © 2009 Nagios Core Development Team and Community Contributors
[1271752865] ndomod: Could not open data sink! I’ll keep trying, but some output may get lost…
[1271752865] Event broker module ‘/usr/local/nagios/bin/ndomod.o’ initialized successfully.
[1271752866] Warning: Host ‘Datacenter-link’ has no services associated with it!
[1271752866] Finished daemonizing… (New PID=32028)
[1271752866] HOST FLAPPING ALERT: AP0139IT;STARTED; Host appears to have started flapping (58.6% change > 20.0% threshold)
[1271752866] SERVICE FLAPPING ALERT: s44k8570;Cpu Avg. Load;STARTED; Service appears to have started flapping (21.4% change >= 20.0% threshold)
[1271752866] SERVICE FLAPPING ALERT: srv_notespt;Mem Check;STARTED; Service appears to have started flapping (52.4% change >= 20.0% threshold)
[1271752866] Warning: Host performance command ‘process-host-perfdata’ was not found - host performance data will not be processed!
[1271752866] Warning: Service performance command ‘process-service-perfdata’ was not found - service performance data will not be processed!
[1271752876] HOST ALERT: AP0092IT;UP;HARD;10;OK - 10.172.52.90: rta 52.0 ms, lost 0%
[1271752876] SERVICE ALERT: AP0114IT;PING;CRITICAL;SOFT;1;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271752876] HOST ALERT: AP0089IT;DOWN;SOFT;1;CRITICAL - 10.172.33.90 is down (lost 100%)
[1271752886] HOST ALERT: SOL-PROGAS;UP;HARD;1;OK - 10.172.52.1: rta 51.1 ms, lost 0%
[1271752886] SERVICE ALERT: AP0114IT;PING;CRITICAL;SOFT;2;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271752896] HOST ALERT: AP0114IT;DOWN;SOFT;1;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271752956] HOST ALERT: AP0089IT;DOWN;SOFT;2;CRITICAL - 10.172.33.90 is down (lost 100%)
[1271752976] HOST ALERT: AP0114IT;DOWN;SOFT;2;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271753006] SERVICE ALERT: AP0092IT;PING;OK;HARD;3;OK - 10.172.52.90: rta 52.7 ms, lost 0%
[1271753006] SERVICE ALERT: AP0114IT;PING;CRITICAL;HARD;1;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271753006] SERVICE ALERT: AP0089IT;PING;CRITICAL;HARD;1;CRITICAL - 10.172.33.90 is down (lost 100%)
[1271753231] Nagios 3.2.0 starting… (PID=32343)
[1271753231] Local time is Tue Apr 20 10:47:11 CEST 2010
[1271753231] LOG VERSION: 2.0
[1271753231] ndomod: NDOMOD 1.4b8 (07-16-2009) Copyright © 2009 Nagios Core Development Team and Community Contributors
[1271753231] ndomod: Could not open data sink! I’ll keep trying, but some output may get lost…
[1271753231] Event broker module ‘/usr/local/nagios/bin/ndomod.o’ initialized successfully.
[1271753231] Warning: Host ‘Datacenter-link’ has no services associated with it!
[1271753231] Finished daemonizing… (New PID=32344)
[1271753231] HOST FLAPPING ALERT: AP0139IT;STARTED; Host appears to have started flapping (58.6% change > 20.0% threshold)
[1271753231] SERVICE FLAPPING ALERT: s44k8570;Cpu Avg. Load;STARTED; Service appears to have started flapping (21.4% change >= 20.0% threshold)
[1271753231] SERVICE FLAPPING ALERT: srv_notespt;Mem Check;STARTED; Service appears to have started flapping (52.4% change >= 20.0% threshold)
[1271753231] Warning: Host performance command ‘process-host-perfdata’ was not found - host performance data will not be processed!
[1271753231] Warning: Service performance command ‘process-service-perfdata’ was not found - service performance data will not be processed!
[1271753371] SERVICE ALERT: AP0092IT;PING;OK;HARD;3;OK - 10.172.52.90: rta 54.0 ms, lost 0%
[1271753371] SERVICE ALERT: AP0089IT;PING;CRITICAL;SOFT;1;CRITICAL - 10.172.33.90 is down (lost 100%)

luca · April 20, 2010, 5:59pm

If the service dies without a line in the logs there’s not much which can be done to understand what’s happening…

but it still looks like there’s a problem in NDO…
[1271752164] ndomod: Could not open data sink! I’ll keep trying, but some output may get lost…

groth · April 20, 2010, 6:34pm

There is more than one symptom in those logs, I’m not sure they are all related or not.

For the NDO service, verify your login information.
Make sure you can login to MYSQL with the credentials you’ve supplied, and that user can SELECT, INSERT, UPDATE and DELETE on the tables.

NDO shouldn’t be knocking over the main Nagios. Seeing as Nagios falls over in a couple mins, try running it on the command line (non deamon mode).
The output of that might give you a better hint as to what is going on.
Post the output back here for more help.

For the services, check you can ping them from the command line. If you can’t, I’d suggest temporarily removing them from your config (add a “register 0” in the config). Simplify your problems, and work on one at a time.

GR

Dialkia · April 26, 2010, 9:01am

[quote=“litos79”][1271752164] Nagios 3.2.0 starting… (PID=31781)
…
[1271752164] Lockfile ‘/usr/local/nagios/var/nagios.lock’ looks like its already held by another instance of Nagios (PID 31673). Bailing out…
[1271752164] ndomod: Shutdown complete.
[1271752164] Event broker module ‘/usr/local/nagios/bin/ndomod.o’ deinitialized successfully.
[/quote]

this means you start Nagios before stopping another running instance of Nagios.
Try service nagios restart instead of service nagios start

[quote=“litos79”][1271752244] Error: Cannot open main configuration file ‘/usr/local/nagios/.v’ for reading!
[1271752244] Nagios 3.2.0 starting… (PID=31948)
[1271752244] Local time is Tue Apr 20 10:30:44 CEST 2010
[1271752244] LOG VERSION: 2.0
[1271752244] Bailing out due to one or more errors encountered in the configuration files. Run Nagios from the command line with the -v option to verify your config before restarting. (PID=31948)
[/quote]

‘/usr/local/nagios/.v’ ? typo?
Have you tried the suggestion of the log file? Can you provide the last few line when verify the config?

[quote=“litos79”][1271752865] Nagios 3.2.0 starting… (PID=32027)
[1271752865] Local time is Tue Apr 20 10:41:05 CEST 2010
[1271752865] LOG VERSION: 2.0
…
[1271753006] SERVICE ALERT: AP0114IT;PING;CRITICAL;HARD;1;CRITICAL - 10.172.57.90 is down (lost 100%)
[1271753006] SERVICE ALERT: AP0089IT;PING;CRITICAL;HARD;1;CRITICAL - 10.172.33.90 is down (lost 100%)
[1271753231] Nagios 3.2.0 starting… (PID=32343)
[1271753231] Local time is Tue Apr 20 10:47:11 CEST 2010
[1271753231] LOG VERSION: 2.0
…
[1271753371] SERVICE ALERT: AP0092IT;PING;OK;HARD;3;OK - 10.172.52.90: rta 54.0 ms, lost 0%
[1271753371] SERVICE ALERT: AP0089IT;PING;CRITICAL;SOFT;1;CRITICAL - 10.172.33.90 is down (lost 100%)[/quote]

Like luca said, if there wasn’t a line of log stating Nagios exitting, it’s difficult to understand how it failed.
And btw, how did you know Nagios failed?
Also, investigate /var/log/messages may give you a clue, if nagios.log doesn’t have more useful hints.

The NDO problem, probably should be considered later, since it can’t make Nagios fail, like groth said.