I have a distributed nagios setup. Everything is working really well bar the CGI on the central server.
The central server has
1500 hosts
Active Checks: 348
Passive Checks: 9564 (all updated form distributed hosts)
Almost all of the CGI’s on the central server runn at 99% cpu for up to 2 minutes to display information.
A strace shows nothing but brk’s for very long periods of time , I’m no strace expert but based on what i read thats related to allocating memory ?
The configs total about 3mb, so that should only take a few seconds to load into memory.
The machine is a dual Xeon2.8gig 4gig mem, with heaps of memory free.
Linux Debian Sarge.
Nagios Version 2.0b3
Its not really possible to upgrade nagios at this stage as it is a critical production based environment for a ISP, unless people can confirm this is definately a version related issue.
When I view the Service Detail link, that lists every service, mine is a bit slow too. How to speed that up, sounds like an apache issue.
I’m at a loss except to possibly ditch the central server and make several nagios websites. For example, one site for subnet a.a.a.a and one for subnet b.b.b.b or whatever makes sense in your setup. I know that is a lousy solution, but how else can we read over 10,000 items and display them quickly. Surely, someone with apache knowledge would know how to cache this data.
I could definately configure apache to do web caching and all that jazz but the problem isn’t the amount of content being served from the website. The content itself is well under a mb and apache delivers it in a few ms. The problem is the actual nagios CGI binaries consuming heaps and heaps of CPU doing whatever they do for up to 2 minutes.
Maybe i should just read some source code