I work for a large australian ISP that uses netsaint.
Currently we have a netsaint ‘distribution’ server in every state here in australia (7) and one in New Zealand, 8 in total.
Each one monitors approximately 200 hosts. With about 5 checks on average per host, checked on average every 5 minutes.
Currently all the servers run as active and each server is monitored via the web interface. None of the servers have load issues and handle this montoring level without problem. (All machines central and distro are dual 3gig ibm 335’s running debian based linu:evil:
Going forward we need all the netsaint/nagios checks and histories to be centralised. Using nsca (in daemon mode, other modes had even more problems) with netsaint presented all sorts of problems. The big ones
- Each nsca process waits until netsaint reads from the external command file, so a massive amount of load is unevenly used as all the nsca processes unload.
- Netsaint is unreliable in its regular reading of the external command file.
- Netsaint appears to have memory leaks and bugs with accepting the checks even if it does read them from the external command file.
Moving forward the only option I have is to do away with netsaint/nagios completely or try nagios (which will require a massive rewrite of internal code to build the changed nagios config style from our network provisioning database).
My real questions are:
- What is the best, proven(assuming it has been done before), ways of scaling nagios to say a thousand checks per minute, particularly using the Central/Distribution NSCA model.
- Does nagios actually using SQL databases in an proper/efficient manner or is it still the same code as netsaint. (Theres nothing in the changelog that would make me think theres been major changes)
- Does nagios reliably check external commands, even up to the granuality of once a second?
- Does the NSCA intermittent extreme load issue still exist with nagios?
- Confirm for me (I can’t see why it wouldn’t be based on the docs) that netsaint distribution servers will work fine via NSCA to a nagios server (helpful for temporary integration)
Any advice is appreciated.