Six Sigma with Nagios

malandrew · May 8, 2009, 7:42pm

As the project I’m hoping to use Nagios for is very complex, I was wondering if someone in the Nagios community could provide advice. I’ve tried as best as possible to define in detail the project. I appreciate any help anyone here can offer me.

thanks,
Andrew

=====

**Objective: **
Maximize the number of virtual machines per physical host using Six Sigma statistical process controls via a measuring tool such as Nagios.

Problem Scope:
I am the product manager for a web hosting company offering cloud computing services in Latin America. Our company’s aim is to be the “value” provider of virtualized machines, which means we are very cost driven.

One of the problems we encounter in my country is that hardware costs are exhorbitant (70%+ more expensive than in the United States). As such we must try to squeeze every ounce of processor, RAM, disk space and networking out of every machine we use.

The way I see it, traditionally failure rates could traditionally be estimated by multiplying the failure rates provided by the manufacturer of each component in the system (processor, RAM, HDD, etc.). This quit being a concern when the MTBF for each part became high enough that, as a service provider, it no longer made sense to calculate the failure rates. Now with virtualization/cloud computing the situation has changed. On one hand the pooling of physical resources reduces the risk of failure dramatically, on the other hand, how we use those resources can greatly impact the failure rate.

For example, now that we can stuff more than one VM on a physical machine, it would make sense that we stuff as many machines as possible without impacting quality (performance & stability). The questions now are “how do we measure this limit?” and “how do we control this?”

The only answer I can think of is to apply the same TQM management practices used by the hardware manufacturers when they push tolerances such as minimizing head fly height in hard drives, maximizing hard drive density or maximizing transistor density.

This is where Nagios, Zabbix, Pandora and other monitoring tools (with the default install) come up short. As far as I can tell, they are designed to be reactive. They are used to notify network and system administrators when hardware & systems don’t perform. As a cloud computing provider, the decisions I make about how to set up the network, how many VMs I put on a machine, how I set up the disk images, whether disk images have paravirtualization, etc. all can have a significant impact on performance. If we were the typical corporate data center, we would simply follow all the best practices and buy all the best hardware. Unfortunately, that’s not realistic given our product positioning. Since there are many parameters I am going to be manipulating, I need to be proactive, because making a change to a system no longer just impacts one client, it can adversely impact dozens to hundreds of clients all at once. I need to know the impact of any changes I make beforehand, preferably in a test environment or with beta testers.

The Experiment:
Our first goal is to use Nagios to help us maximize the number of virtual machines per physical host. Plan to do this using Nagios and six sigma methodology. First we want to use Nagios to analyze performance metrics such as packet loss, ping time, CPU % usage, RAM% usage, etc. etc. etc. over time. Each experiment will occur with a one week time frame, so that we don’t have any issues with traffic variation from time of day or day of the week.

Step 1: Measurement Systems Analysis:
The first step is to measure repeatability and reproducability of our measurements. To do this, we are going to isolate the machine running Nagios by placing it on a standalone 1U server and place it in the same rack as the HP C-Class enclosures housing our virtualized machines. This machine will connect to the VMs via one common switch. Initially we will measure a basic VM image idling (no cron jobs for example) for one day at a time, adding an identical VM every 4 hours and measuring the image on a variety of metrics. We will perform this MSA with Linux and with Windows.

Next we will repeat the above for each of the following stress tests: (1) Full CPU load, (2) Full RAM load, (3) Maximum network throughput. One final test will occur with all four stress tests at once.

With these measurements in hand, we should be able to calculate the R&R and hopefully also establish an upper ceiling on the number of VMs on each physical host for each of the parameters stressed.

Step 2: Beta Testers
Next we plan to set up one HP C-Class enclosure with 14 blade servers. Each blade server with have a varying number of virtual machines from beta testers with real work loads, ranging from 4 to 48, with steps of 4 VMs. Initially the idea is to use 48 as the upper limit, however this quantity may change depending on the idling upper limit we discover during the MSA. This test will run for one week to mitigate against variance due to time of day or day of the week. Analyzing the data here, we should be able to establish a real production upper limit plus a buffer where adding additional VMs to each physical host begins to have a significant impact on performance.

What I want Nagios to do:
The question now, is how can Nagios help me accomplish the above tasks. The last time I did any quality control work was about 5 years ago and I used SAS JMP 5.1. What I wanted to know is: First, are there Nagios add-ons that provide me with tools for statistical process control and are there tools that allow me to define groups of VMs (i.e. the control group and the experimental group). For example I’d like to be able to group all the VMs on a particular physical machine, group all the VMs using a particular storage volume, group all the VMs in a particular blade enclosure, etc.

At the bare minimum I need to be able to export all this data using comma separated values so I can analyze in JMP it against the attributes I’m manipulating, but ideally I’d like Nagios to take care of as much of this work if possible.

Anyways this is a work in progress, I’m still refreshing my knowledge on quality control practices, which means the above experiments are subject to change. I’m also open to suggestions.

Thanks for reading and thanks for any help you guys can offer me.

Andrew