Good point and it’s the EXACT same battle I had to fight.
First, the short answer. You either want to be proactive or not. “What style of management are you/they?”
The manager of IT here didn’t want Nagios since he decided that he would “let the customer tell us when something is broken”. But my boss, wanted it for the networks/server’s that we control, so I installed it. We fix things that aren’t even down yet(proactive). We have found problems with the network 5 minutes after they have broken, which would normally take 45 minutes, just to find the broken cable/etc. We have found dead power supplies in redundant power supply systems that would NEVER have been discovered until BOTH power supplies where dead, and then it would have been a crisis. I could name about 50 items that would have cost an hour(or more) of production downtime each, just off the top of my head, saved by Nagios.
Here are more points on how I have overcame that obstacle.
First, download a MIB browser like mbrowse kill-9.org/mbrowse/ Install it, and get an IP addy of just any switch(not a hub) on your network. Put the IP addy in mbrowse and query the status of an interface in that switch using the correct OID, by using the public (read only) default password of “public”. Now show them that you can actually change the status of the port by using the secure (read/write) default password of “secure” and down the port. If they are so concerned about security, then surely they must be concerned about this, but failed to secure the “secure” password. Why?
Another, if it’s simply “no extra daemon’s” running on remote hosts, then open up nmap. Use nmap to scan the open ports on some Solaris server. Ask them why they need “finger” running on this server, or the multitude of other daemon’s that open up ports on this server? If daemon’s are a concern, why do they allow “finger”?
Another:
If you use the nrpe addon, there will be a nrpe daemon running on the remote hosts, and the Central nagios server(and possibly malicious people) will be making connections to the remote hosts. In my opinion, that is not a good thing. By using nsca, the nsca daemon is running on the Central nagios server. Nobody is making any connections to the remote hosts so therefor, it is as secure as it could possibly be. Connections are made from the remote hosts to the Central nagios server. When you install Nagios on remote hosts, you compile it to not use CGI, so there is no way anyone can force nagios to execute a command by using the cmd file in the RW directory, since nagios will also be configured to not perform external command checks in the nagios.cfg file.
But if the only concern is running a daemon and they just don’t have the extra CPU power, ask them if they can actually measure the amount of CPU used by Nagios on that machine, and then ask them if that amount is simply not acceptable. Of course, you will have to configure some “test” box, just like you would do if it where an actual remote host. Grab some junk pc, install Linux on it, install Nagios as I’ve described (and the Nagios docs) as a remote host. Ask them if they can actually tell that Nagios is running, and just how much of a burden is it placing on this box.
A typical Solaris or NT install has many daemon’s and process’ running, that are just not needed. Why is there an snmpd running? Why is there so many open ports? Why do they allow telnet, when it should be setup with ssh? Why ftpd if nobody is using it(some server’s need it, some don’t).
The argument of “no extra daemon’s” is not based in practice, but only prejudice, since they are not enforcing this argument on the non-essential daemon’s they have allowed thus far, like snmpd or finger. Not allowing the Nagios daemon running is like saying, we don’t want to be proactive in finding/fixing trouble, but we do want to be proactive in making sure that nobody install’s anything that isn’t made by Microsoft, or something that doesn’t cost over $5000.
I’ve shown our Nagios application to the head of our department (VP/IT), and he is amazed that I’ve done such a great job, with no money spent. He’s also very upset that the manager of IT spent over $5000 on a HP product(similiar to Nagios), never installed it over a year later, and most likely never will.
We have a boss from the “corporate” parent company tech department, coming to see ME, and look at the Nagios installation here and how we have become a proactive tech shop.
The amount of checks being performed on remote hosts, is negligible (discspace, cpu, process’, etc) and you can’t even measure the amount of CPU time that this daemon is consuming. After all, how much CPU power is spent performing “df -k” or “ps -ef|grep ftpd”. Show them the commands that are performed by using the command line. Show them the source code, and that the command you just ran is nothing more that “df -k” and some extra data calculation.
On the Central server, that performs over 1,200 active/passive checks, of course, this CPU is working very hard and is dedicated to nagios/apache alone.
Bottom line is, I’ve had Oracle DBA’s, Sun/Solaris expert’s, Veritas expert’s, and others and they have all heard of Nagios and they have no concern of allowing Nagios to run on these Oracle, Sun, Solaris, etc server’s. Nagios is not some crap code that barely stays running. I’ve never even seen it crash yet, on the Central or on the remote hosts, and we are queriing many Oracle/Sun box’s, hundred’s of switches, and on and on.
I’m positive, your production team is not much different than ours. So when you have shown them just how lax they are in there policy, since they allow “finger” and other not needed daemons, I’m sure it still won’t make much of a difference. They either have to believe in proactive management, or crisis management, and you won’t be able to sway them either way. This is a management style issue that you are battling now, and has nothing to do with “extra daemon’s”, since they have many allowed now, that do nothing for them.
If all else fails, just setup cron to run the plugin commands every 5 minutes and pipe the output to the submit_check_result command. The nsca client will then pass the data to the Central Nagios server. This is perfectly OK, and is sufficient to gather the status that you want. But what is not OK, is that someone has adopted a “no extra daemon’s” policy, but only enforces it on applications that are beneficial, and not to “not needed” daemon’s.
Good luck, and tread lightly. They just might get pissed off at you, for showing them that they aren’t all that they pretend to be. Pfft, allowing telnet but they won’t allow Nagios? How ludicrous is that? It’s a great boss, who can accept that they may be wrong in there judgement, and actually allow you to be proactive and install Nagios on critical server’s.
Neither you nor I are script writing experts. Why should we have to spend time making scripts for cron, when Nagios is a proven product. It’s like reinventing the wheel?
This is a noble battle you have my friend, one which I have won, one which has gained me much appreciation by the VP/ IT, and my peers. We support a great number of other smaller “tech groups”, and all of them allow Nagios on there server’s and relish in the fact that they never have to check the log files, disc space, Oracle tablespace free, etc. It’s the department that I work in, that has become so proactive, that Nagios is now known/accepted by almost every employee in this company. Only those managers that have stayed steadfast on there “crisis management” policy, that don’t want anything to do with Nagios.
Edited Fri May 13 2005, 07:39AM ]