Nagios Scalability

Tankarama · October 20, 2005, 5:06pm

I work for a large australian ISP that uses netsaint.

Currently we have a netsaint ‘distribution’ server in every state here in australia (7) and one in New Zealand, 8 in total.

Each one monitors approximately 200 hosts. With about 5 checks on average per host, checked on average every 5 minutes.

Currently all the servers run as active and each server is monitored via the web interface. None of the servers have load issues and handle this montoring level without problem. (All machines central and distro are dual 3gig ibm 335’s running debian based linu:evil:

Going forward we need all the netsaint/nagios checks and histories to be centralised. Using nsca (in daemon mode, other modes had even more problems) with netsaint presented all sorts of problems. The big ones

Each nsca process waits until netsaint reads from the external command file, so a massive amount of load is unevenly used as all the nsca processes unload.
Netsaint is unreliable in its regular reading of the external command file.
Netsaint appears to have memory leaks and bugs with accepting the checks even if it does read them from the external command file.

Moving forward the only option I have is to do away with netsaint/nagios completely or try nagios (which will require a massive rewrite of internal code to build the changed nagios config style from our network provisioning database).

My real questions are:

What is the best, proven(assuming it has been done before), ways of scaling nagios to say a thousand checks per minute, particularly using the Central/Distribution NSCA model.
Does nagios actually using SQL databases in an proper/efficient manner or is it still the same code as netsaint. (Theres nothing in the changelog that would make me think theres been major changes)
Does nagios reliably check external commands, even up to the granuality of once a second?
Does the NSCA intermittent extreme load issue still exist with nagios?
Confirm for me (I can’t see why it wouldn’t be based on the docs) that netsaint distribution servers will work fine via NSCA to a nagios server (helpful for temporary integration)

Any advice is appreciated.

Clipper · October 21, 2005, 12:18pm

Hi,

I cannot answer to all of your questions because I do not use NSCA yet, but what I can say is the following :

The Nagios 2.0 received major improvements in all areas compared to Netsaint. I am pretty sure that you won’t have any scalability issue with it.

Nagios is not using mysql. the version 1.x could do that, but it was abandoned due to various issues. There are plugins/addons though that make use of SQL (I think Monarch does it by the way. Check out www.itgroundwork.com).

Read other posts in this forum, and in particular the posts from Jakkedup. He has got the widest Nagios installation I ever heard of, fully using NSCA.

In particular, you might want to read those two (very long and very interesting) posts :

“NSCA add on - how best to use?” this one is stick on top of forum

“Adding switches…” Another interesting one to read.

Hope this helps
Clipper

jakkedup · October 22, 2005, 4:16pm

Let’s see, you have 1000 service checks from 7 different remote sites. That’s 7000 checks per 5 minutes. That makes it 23 checks/sec or .04seconds/check. With modern CPU’s, that doesn’t look impossible since .04 seconds is forever in a CPU’s nanosecond point of view.

Since you have so many remote servers all sending to one Central nagios server, and you are having trouble with the performance of that Central Server, you should:

Remove any and all active checks from the central server.
add instances = UNLIMITED to your /etc/xinetd.d/nsca file on the Central nagios server.
max_concurrent_checks=0
service_reaper_frequency=10 #Perhaps bump this higher. The checks will be processed, but they simply won’t show it on the status page immediately. For example, 200 checks have been made already but 10 seconds has just now come and gone. So after this 10 seconds, we are going to process the results of those 200 checks. If you made this lower, you would have “ten checks made in the last 5 seconds and 10 checks processed” or if you made it a higher number you would have 300 checks made in the last 20 seconds and will now process all 300 checks. So higher would be a lower load on server.
command_check_interval=-1 #That is a MINUS ONE. Nagios will check for external commands as often as possible.
increase the normal_check_interval in services.cfg for any checks that are not that time sensitive. For example, a disc space check every 5 minutes seems a bit excessive don’t you? In normal cases, usage grows at a slow pace. So tweak every check made to it’s absolute maximun that you feel is needed.

That’s all I can think of for now. If I think of more, I’ll post back. But it doesn’t look impossible. Actually, I would really like to know what the limit is that nagios can handle on a powerful PC.

PS, would you please post back just how many checks the Central Server is handling now? If you do as you plan, I’d like to see just how many checks you have it processing and if you are able to put it to it’s knees.
For now, this is what I have.
575 active checks
644 passive checks
every 5 minutes.
But the CPU is quite bored and has plenty of time for more work.
Edited Sat Oct 22 2005, 05:28AM ]

jakkedup · October 22, 2005, 4:29pm

A pic of my cpu usage.

I hope to see yours flatlined at 100 %. I mean that in a good way.
Edited Sat Oct 22 2005, 05:30AM ]