Check cluster status from the nagios server for all nodes


I am trying to determine if this is possible.
I am monitoring a linux cluster consisting of 658 nodes.
each node has a job queue and a queue status that I wish to report the current status of. The kicker is that the master node is queried to provide the information, if I run the check for each node I effectively run 658 queries on a single node which is not very effective.
what I wish to do is run the query only on the master node but report the status and queue length for each system.

the output of the command is this:

host1 ok 2 2 0
host2 closed_Adm - - 0

host658 ok 2 2 2

where the Status and Njobs are the items of interest.

Any input is greately appreciated.

Thank You



[quote=“stanglinden”]I am trying to determine if this is possible.
…if I run the check …
what I wish to do is run the query …
…the output of the command is …

What check, what query, you get output for what command?

Sorry, in order to tell if that query/command can be used as you wish, it might help to know what it is. Even when knowing/having the command it can still be hard to figure out sometimes.


the command is "lsfadm /lsf/linux24/bin/bhosts -w"
There are 2 options to running this command if I insert the hostname after the -w (bhosts -w host1) I get the information for that host, which is how I currently have implemented the check using the chech_by_ssh plugin. This implementation leads to a large overhead on the main cluster node as it is the one that ultimately responds to the request.

What I wis to implement is running the bhosts -w command with out the host specified which gives a list of output for each host in the cluster. I wish to parse the output (see original post) and apply the data so each host will have a service item associated for the number of jobs in the queue and for the status of the queue.


OK, you lost me. Sorry.
But your original post " each node has a job queue and a queue status that I wish to report the current status of. "

I fail to see how checking a node on a cluster once every 5 minutes is a huge burden. I would define one check per node for a total of 658 nagios checks and be done with it. Unless of course, these nodes are running on 8080 cpu chips.


this is not so much a problem for the nagios server or the network but rather on the single node that replys with this information.

host1 (master node)
host2 (system being monitored)
host3 (system being monitored)
host4 (nagios host)

host4 runs the command bhosts -w (by way of the check_by_ssh plugin) on host3, which inturn actually queries host1 (the master) to reply with this information. Resulting in 2 actual requests and 2 actual replys (host4 - host3, and host3 - host1). When multiplied by 600 hosts the requests for status on host1 can cause the master node to experience issues and/or hang.

That is why I was asking if there is a way to run the bhost -w command on the nagios server (it is a member of the cluster), then parse the data returned for each host. Remember if no hostname is passed to the bhost -w command the status for all hosts is returned. If you pass a hostname only the status for that host is returned.


From what I see in the check_by_ssh command, you should be able to run that command. But the output is going to be huge. So I guess I understand that much about your problem. You are going to have a $OUTPUT$ that will be over 500 lines long. One line for each host. How to get around that, I don’t know. Again, sorry, but I"m a programming dummy.