Hi, we have just started monitoring Windows servers and we are getting tons of Memory Warning and Critical notifications throughout the day. One solution I thought of which I feel is more of a “kludge” to the problem was to adjust the
"retry_check_interval" to something like “25” since the warning and recovery alerts seem to start and end every 25 minutes a part. My question is how and what is the correct approach to fixing these issues rather than trying to kludge the issue.
So you could try and work out what process or scheduled task is running on your windows servers (maybe take a look at the sysinternals process tools at technet.microsoft.com/en-us/sysi … fault.aspx to help in your diagnosys) that lasts for 25 minutes and put an end to its excessive memory usage(!), or, say, if that process must run and this is it’s “standard” behaviour, then adjust your check warning/critical thresholds upwards some, or thirdly, perhaps even add more memory to the servers maybe?
You are right not to want to ‘kludge’ this as you put it. To my mind, if you are monitoring the memory usage and getting warning/critical notifications which are indicating something is amiss, but then you are either trying to ignore or ‘work around’ them, I would question whether there is any point in monitoring said usage.
I just wanted to add:
if you really want to “kludge” the problem (ie: once you have exhausted all possibilities of finding what the problem is … after all, this is windows we are talkong about! :)), I’d recommand you to tweak a little bit more the options:
instead of using “retry_check_interval” to 25, do something like:
retry_check_interval 5
max_check_attempts 6
=> the resultat is the same (notification sent if the problem stays after 25min), but you’ll have a few more tests between that may help you diagnosize your issue
Thanks for the reply’s guys I will look at bumping up the threshold. These are SQL database servers so memory seems to get sucked up past 90% our warning threshold then released again. Do any of you guys monitor memory? I feel that its such a volatile service that is always changing so its kind of hard to monitor since its always going up and down.
Hmm… I remember when I installed a CommVault Galaxy backup server back in the day, that it had some sql 2000 enterprise or something installed along with it, and part of the install process was to configure something in SQL to stop it gradually eating all the available memory, as the SQL bit was obviously not the servers primary role. So there would seem to be a way to limit SQL’s memory usage, perhaps you could look into that, or as previously mentioned, add more memory (though there’s no way of telling whether SQL would just go and use up all the extra memory as well I guess!)
I don’t monitor memory usage on any wintel personally, though IMHO it’s a valid check if there is a reason for it. So, like if you want to make some changes as Loose recommends, so you know whether your memory usage has been at an abnormal level of say > 98% for an abnormal amount of time say > 30 mins (or 60, or whatever figure is abnormal for your situation), when it goes off you know you might want to be jumping on the box and be finding out if you’ve got a stuck process, or perhaps this might indicate half your memory is suddenly kaput and you’ve only got half the resource you should have, and so on. Perhaps, surpress your warning notifications as well, that way you only get notified when its critical hard down, i.e. deffo having some long term issues, and you won’t be bothered by it fluctuating in and out of the warning threshold under normal operation.
But yeah, I can see that on an SQL server that one moment is running some easy query then in the next might be running some stupidly large query, memory usage will be up and down like a yo-yo, so you’ll have to balance out the annoyance factor of over-alerting against the fine-tuning of the thresholds, and then against the perceived importance of monitoring the memory service in the first place, and figure out if it’s “worth it” from both an administrative perspective and a business perspective. What I can say for sure though is that ‘over-alerting’ is the worst thing to my mind, and only leads to situations like when something IS seriously wrong it can and usually does just get ignored because, “Well, we didn’t really look into it because the server ALWAYS alerts like that…”