I am trying to determine if this is possible.
I am monitoring a linux cluster consisting of 658 nodes.
each node has a job queue and a queue status that I wish to report the current status of. The kicker is that the master node is queried to provide the information, if I run the check for each node I effectively run 658 queries on a single node which is not very effective.
what I wish to do is run the query only on the master node but report the status and queue length for each system.
the output of the command is this:
HOST_NAME STATUS JL/U MAX NJOBS
host1 ok 2 2 0
host2 closed_Adm - - 0
…
host658 ok 2 2 2
where the Status and Njobs are the items of interest.
Any input is greately appreciated.
Thank You
Mike
[quote=“stanglinden”]I am trying to determine if this is possible.
…if I run the check …
what I wish to do is run the query …
…the output of the command is …
Mike[/quote]
What check, what query, you get output for what command?
Sorry, in order to tell if that query/command can be used as you wish, it might help to know what it is. Even when knowing/having the command it can still be hard to figure out sometimes.
jakkedup,
the command is "lsfadm /lsf/linux24/bin/bhosts -w"
There are 2 options to running this command if I insert the hostname after the -w (bhosts -w host1) I get the information for that host, which is how I currently have implemented the check using the chech_by_ssh plugin. This implementation leads to a large overhead on the main cluster node as it is the one that ultimately responds to the request.
What I wis to implement is running the bhosts -w command with out the host specified which gives a list of output for each host in the cluster. I wish to parse the output (see original post) and apply the data so each host will have a service item associated for the number of jobs in the queue and for the status of the queue.
OK, you lost me. Sorry.
But your original post " each node has a job queue and a queue status that I wish to report the current status of. "
I fail to see how checking a node on a cluster once every 5 minutes is a huge burden. I would define one check per node for a total of 658 nagios checks and be done with it. Unless of course, these nodes are running on 8080 cpu chips.
this is not so much a problem for the nagios server or the network but rather on the single node that replys with this information.
Example:
host1 (master node)
host2 (system being monitored)
host3 (system being monitored)
host4 (nagios host)
host4 runs the command bhosts -w (by way of the check_by_ssh plugin) on host3, which inturn actually queries host1 (the master) to reply with this information. Resulting in 2 actual requests and 2 actual replys (host4 - host3, and host3 - host1). When multiplied by 600 hosts the requests for status on host1 can cause the master node to experience issues and/or hang.
That is why I was asking if there is a way to run the bhost -w command on the nagios server (it is a member of the cluster), then parse the data returned for each host. Remember if no hostname is passed to the bhost -w command the status for all hosts is returned. If you pass a hostname only the status for that host is returned.
From what I see in the check_by_ssh command, you should be able to run that command. But the output is going to be huge. So I guess I understand that much about your problem. You are going to have a $OUTPUT$ that will be over 500 lines long. One line for each host. How to get around that, I don’t know. Again, sorry, but I"m a programming dummy.