I am running Nagios 1.2 on a Fedora Core 3 and facing the following problem. After about 7 hour uptime the CPU usage jumps to 99.9% and the only thing I can do is to reboot the machine with the good old reset or power switch. I can issue the reboot, shutdown or telinit but nothing happens. Due to high CPU usage the logging stops as well so I cannot see a thing in the log files. I am also running mrtg (2.10.15) and apache (2.0.53). I found a message on a mailing list with the same problem but the sender used Netsaint and RH9 there. There were no any answer for that message
I used Netsaint on RedHat 7.2 earlier and the situation was the same except that the occurance of the failure was quite rare. Does anyone know what happens? I have already tried all of the tips I had found on the web but they did not helped me.
Hardware: xSeries 235 eServer
Kernel: Linux 2.6.12-1.1381_FC3smp
are you having an increase in CPU consumtion or is it instantaneous.
Possibly a failing check script of some sort which goes up on the CPU… is top working? cehck what is conusming the CPU, top and then try to track down the incriminated process via ps (it will be the one with fast growing CPU time) when you get it kill it and see what happens,… next try to find out what calls this process and why it hangs.
Unfortunetally, there is no increase in CPU consumption. The CPU usage reaches the mentioned level very quickly so the tracking of CPU consumptions of processes is almost impossible. In the logfile of sar there are no any signs of sudden increasing of system activity. I cannot kill any processes, cannot even issue w or ps but top. I reduced the number of check scripts. The only one I use is check_ping but nothing changes. The system behaves in the same way. I use an smp kernel on a dual Xeon cpu machine. Could it be a bug in that kernel?
I was running Nagios 1.2 + nagiosgraph on a FC3 box and had a similar problem. The system would run for about a week and then the CPU utilization would saturate and the system became unresponsive. I started monitoring the box more closely and what I found was that the nagios process was slowly using more and more memory. Once all the memory was used up it would start thrashing the swap space hard.
I used Nagios and nagiosgraph to chart the CPU utilization and it would go up logrithmically once available memory became low so the system would appear to be fine until the last bit of memory was used up and then the CPU would reach staturation very rapidly.
To temporarily fix the problem I would restart the nagios process every four days or so. When the nagios process was restarted it would release all the memory it was holding and start over.
I’m currently testing nagios 2.0b4 on a FC4 box and I no longer see the problem. I’m not sure if it was something in FC4 or nagios 1.2.
The two issues sound similar, however, yours seems to occur much faster and there is a big difference in hardware. I was running nagios on a 600MHz Via Epia system with only 128MB of RAM. That xSeries eServer has much more power and likely more RAM. Since you experience the problem much faster you could do somethying like ‘vmstat -a 30 > vmstat.txt’ for the 7 hour period and if it is memory you’ll have a log of the memory and swap usage. Edited Sun Nov 27 2005, 11:11AM ]
I have been monitoring the memory usage as well and it seems there is no increasing in memory usage. Everything seems to be perfect till the occurance of the failure. The server has 4GByte of RAM and 8GByte of swap memory.
I have adjusted the network parameters of the kernel and the system uptime is nearly 14 hours now. Anyway, I will try the latest version of Nagios.