Nagios uses 100% CPU

I have a box that I am running Nagios on. It is OS X running 10.5.3.

I just starting having a problem where Nagios is using all of the CPU. I can verify my config file fine, and start / stop nagios.

Any ideas on things to check out on this?

Thanks

Do you have any check plugins that you’ve developed on your own? I had such situation and the plugin killed the Nagios server with endless loops and no answers from checked hosts.

Try to run:
ps -aux
or
ps -aux | less
to see the running processes, and then realise which process eats the CPU.

I have all the stock plugins, nothing customized.

The task using the CPU from ps is:

41767 ?? 3:37.11 /opt/local/bin/nagios -d /opt/local/etc/nagios/nagios.cfg

I went and checked the config file, and nagios -v reports no errors with the file.

Are you absolutely sure it is Nagios that uses the CPU at high percentage? Execute
top
at the command line and post results here, also you can post
ps aux | grep nagios
results

Yes, I am pretty sure that when I see the process called “nagios” it is, indeed, nagios.

Here is top:

Processes: 98 total, 4 running, 94 sleeping… 349 threads 08:56:26
Load Avg: 5.23, 4.82, 4.83 CPU usage: 95.38% user, 4.62% sys, 0.00% idle
SharedLibs: num = 2, resident = 37M code, 0 data, 3540K linkedit.
MemRegions: num = 8861, resident = 590M + 22M private, 77M shared.
PhysMem: 151M wired, 513M active, 286M inactive, 951M used, 73M free.
VM: 4588M + 130M 41153(0) pageins, 457(0) pageouts

PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD RSIZE VSIZE
178 nagios 187.8% 17:57:30 2 28 26 436K 200K 720K 19M
74848 top 7.0% 0:00.73 1 18 29 476K 212K 1076K 18M
74812 Terminal 2.6% 0:01.38 3 98- 105 1768K 5024K 6416K 120M
7983 AppleVNCSe 2.1% 0:04.86 6 105 87 2472K 2336K 5316K 100M
104 WindowServ 0.8% 0:47.95 5 170 284 7044K 8316K 15M 98M
0 kernel_tas 0.4% 8:17.36 47 2 507 3868K 0 140M 172M
185 QuickTimeS 0.1% 2:51.33 6 36 39 1008K 924K 1992K 21M
245 ruby 0.1% 1:41.12 2 12 99 25M 1416K 26M 43M
194 ruby 0.1% 1:41.52 2 12 99 25M 1416K 26M 43M
80 ruby 0.1% 2:45.60 2 15 27 1480K 888K 2356K 19M
521 ruby 0.1% 1:43.82 2 19 141 30M 1904K 33M 57M
181 ruby 0.1% 1:41.66 2 12 99 25M 1416K 26M 43M
355 ruby 0.1% 1:41.65 2 12 99 25M 1416K 26M 43M
509 ruby 0.1% 1:44.33 2 19 141 30M 1904K 33M 57M
465 ruby 0.1% 1:42.86 2 19 141 30M 1904K 33M 57M
330 ruby 0.1% 1:41.49 2 12 99 25M 1416K 26M 43M


leopard:~ admin$ ps aux | grep nagios
nagios 178 100.0 0.1 76368 720 ?? Rs 12:57PM 1078:46.61 /opt/local/bin/nagios -d /opt/local/etc/nagios/nagios.cfg
root 77 0.0 0.1 75428 712 ?? Ss 12:57PM 0:00.01 /opt/local/bin/daemondo --label=nagios --start-cmd /opt/local/etc/LaunchDaemons/org.macports.nagios/nagios.wrapper start ; --stop-cmd /opt/local/etc/LaunchDaemons/org.macports.nagios/nagios.wrapper stop ; --restart-cmd /opt/local/etc/LaunchDaemons/org.macports.nagios/nagios.wrapper restart ; --pid=none
admin 75279 0.0 0.0 76024 4 s000 R+ 8:57AM 0:00.00 grep nagios
nagios 52001 0.0 0.0 75384 400 ?? S 8:03AM 0:00.01 /usr/bin/mail -s ** PROBLEM Host Alert: 10.20.1.13 is DOWN ** [email protected]
nagios 51999 0.0 0.1 75944 708 ?? S 8:03AM 0:00.01 sh -c /usr/bin/printf “%b” “***** Nagios ***\n\nNotification Type: PROBLEM\nHost: 10.20.1.13\nState: DOWN\nAddress: 10.20.1.13\nInfo: PING CRITICAL - Packet loss = 100%\n\nDate/Time: Tue Jul 8 08:03:07 CDT 2008\n" | /usr/bin/mail -s " PROBLEM Host Alert: 10.20.1.13 is DOWN **” [email protected]

Didn’t mean to offend you, I have just asked to get more familiar with the problem 'cause it looks weird.

Haven’t met such situation before, maybe you could try with reinstallation of Nagios, or update your Mac OS X to the latest version. Looks more like a system or some weird incompatibility problem.

I have the exact same problem. I’m running OS X server 10.5.4 and nagios consistently hogs the CPU. I didn’t see this when I was running it on a Linux box. I have no strange plugins, and it’s the nagios command itself that is driving the CPU (so hard, in fact, that my box locked up last night). It’s been doing this for the last few weeks, ever since I initially setup the box.

Processes:  101 total, 3 running, 98 sleeping... 288 threads                                                                                  09:57:17
Load Avg:  1.15,  1.31,  1.46    CPU usage: 20.73% user, 20.73% sys, 58.54% idle
SharedLibs: num =    2, resident =   73M code,     0 data, 5296K linkedit.
MemRegions: num =  8005, resident =  416M +   19M private,  126M shared.
PhysMem:  326M wired,  518M active,   52M inactive,  938M used, 1622M free.
VM: 5244M + 132M   121585(0) pageins, 0(0) pageouts

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
 1662 nagios      99.6% 23:01.44   2    32     26  256K   200K   912K    19M
 1714 top          7.7%  1:50.26   1    29-    34 1136K   212K  1740K    19M

It’s dual 1.8GHz G5 with 2.5GB RAM. I really really love nagios and have been using it for years, but this is completely strange and I’ve never seen it before. There is nothing crazy in the logs. The only plugins I’m using are: check_{ping,disk,load,procs,users,swap,ftp,http,ssh,dhcp,fping,smb.sh,ircd,imap,smtp,pop}. The config is (almost) identical to what I had on my Linux box before. Stopping and restarting nagios doesn’t help; it immediately starts sucking up as much CPU as it can.

annvix:etc root# ps aux|grep nagios|grep -v grep
nagios    1662  98.1  0.0    76368    912   ??  Rs    9:33AM  26:36.37 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
nagios    1185   0.0  0.0    75488    484   ??  Ss    9:26AM   0:00.18 /sbin/launchd
nagios     317   0.0  0.0    75384    400   ??  S     9:06AM   0:00.01 /usr/bin/mail -s ** PROBLEM Host Alert: titan is DOWN ** vdanen_at_linsec.com
nagios     315   0.0  0.0    75944    708   ??  S     9:06AM   0:00.01 sh -c /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: titan\nState: DOWN\nAddress: 10.10.2.30\nInfo: (Host Check Timed Out)\n\nDate/Time: Fri Jul 25 09:06:24 MDT 2008\n" | /usr/bin/mail -s "** PROBLEM Host Alert: titan is DOWN **" vdanen_at_linsec.com

Sanitized the above somewhat due to the form wanting to replace my email address with some javascript nonsense.