Scaling Nagios

Hi all,

I run a distributed Nagios environment which I’m trying to scale out. Unfortunetly I’m running into some severe performance issues.

Current environment: Nagios 2.3.2 installed by RPM on Centos 5.5 x86_64 in VMware, 4 virtual Core, 7GB RAM, plenty of disk space.

On the central server: Nagios is receiving host and service check results via NSCA from various distributed hosts. We also have pnp4nagios integrated on the central server
Each distributed host sends over between 50 to 150 check results every 30 seconds to 5 minutes.

On a small scale (~500 checks received on central nagios srv) everything works perfectly…

  1. The distributed nodes poll checks using NRPE
  2. Results are correctly received by NSCA on the central server.
  3. Web UI results, perf data updated in good time
  4. PnP4Nagios graphs updated in good time

However, when we try to scale up (1000+ check results received on central nagios srv) things start to break…

  1. Check results are delayed by several minutes (freshness checks also delayed), although pnp4nagios graphs are always up to date.
  2. NSCA eventually crashes system (set up to run in daemon mode). I can see several hundred (if not thousand) nsca processes running and eventually I run out of resources, unable to fork messages etc, which requires a reboot of the machine. I’ve tried increasing the number of file descriptors for the nagios user but that only delays the inevitable.

I have tried switching NSCA to stand alone mode, but the process just dies after a while…As a stop gap method I’m currently running in standalone with a crontab to restart the process every hour. It works, but obviously only a temporary solution. With this solution I’m only using 17% of the overall RAM (3% if you take buffer/cache into account), and load avg is only 1.50. I haven’t yet tried running it in xinetd

Can anyone suggest a way to approach scaling this?

We need to be running around double the capacity we are now with just the one central server. Obviously there’s no obvious resources issue I can see. I’ve seen in some threads people suggest running multiple nagios instances on the master server. This may fix my issue with UI results being delayed, but all the examples I’ve seen suggest that I will need a different Apache alias for each nagios instance - for me that defeats the purpose of having everything going to a single box. I need everything being displayed in a single UI. Is this an option I should be looking at? Can anyone explain how I might be able to get this doing what I want?

As for NSCA, I would have thought that daemon mode would be better than standalone mode given the amount of incoming traffic/checks. Can anyone suggest a suitable method?

Many Thanks.