OK, now let’s get really creative here. In the following, I haven’t added the eth ports on the hosts yet, nor even the switches it connects to. But they could easily be done.
What it does show, is a sort of “service dependancy” but in a graphical way, in the status map. Not only is the dependancy shown graphically, the parents are defined in such a way as to being logically true.
So we have 2 Oracle hosts, Bus-dnmacdb1 and db2. They are running oracle rac so when customers connect to the database, they really don’t know what host they are using or what path and the cpu loads are balanced.
In order for Oracle to start up, the first thing we need is either Bus-dnmacdb1 or 2 to be up and operational as a plain Solaris server. Add all of your Unix type checks to this host. i.e. discspace, cpu usage, etc.
AFter the OS is up and running, then we fire up Raid volumes and mount them (add your raid volume checks to this host). Without the raid volumes, there is no disc space that contains the Oracle tables. Each Solaris server has access to the same volume using Oracle RAC, so therefor, you see a connections from each server to the Bus-dnmac-vol (for volumes) “host”
Next we need an Oracle instance to connect to the database (add instance service checks here). There are 3 ways this could be done. A person could “bypass” the load balancing and connect to a “specific instance” by using serv1 or serv2 instance. Or they could connect by using the serv instance (which would balance the load and put them on either one). So, if the volumes are mounted, then we can start up the Oracle instances (lines on staus map shows this dependancy). There is a line from -serv to -serv1 and -serv2 to show that we need either one of those, in order for the -serv instance to operate. Without either one, then there is no -serv instance.
Now that they can connect to the Oracle instances, they probably want to access some tables. So show this as a host called -tables. If the instances are up, then we can now access tables. The service checks performed on the host “-tables” are several. i.e. tablespace free, cachehits, etc. and all being graphed with nagiostat. These graphs have been a lifesaver and have given us many days warning of impending disaster of a tablespace filling up.
** I’m showing you this, to force home the idea that in nagios, your hosts.cfg file does NOT have to contain a real host, but something that could be useful in showing on the status map in a logical way, so people can see what is broken and where more logically.**
In my Oracle example it would be quick to discover that there is nothing wrong with the Sun Box, nor the Solaris OS, nor the raid hardware, nor the Oracle instances, but in fact, it’s simply a tablespace filled up or something like that. But the customer is simply going to tell you, “my Oracle connection is broke and I can’t do anything with it”.
Edited Fri Oct 14 2005, 06:40AM ]