Performance Zone is brought to you in partnership with:

Jenny is a DZone MVB and is not an employee of DZone and has posted 16 posts at DZone. You can read more from them at their website. View Full User Profile

Thwarting memory leaks and other out of memory errors with machine learning: Part 2

01.15.2014
| 5866 views |
  • submit to reddit
This post was originally written by Sean at the Metafor blog.

In my previous blog I discussed why memory leaks are like scotch. In this post I’ll discuss various ways to identify memory leaks.

When an app has a memory leak and its memory usage is steadily increasing, there will not usually be an immediate symptom. Every system, virtual or physical, has a finite amount of memory (yes there are fancy tricks to dynamically increase memory almost infinitely, but latency makes these impractical most of the time), and if the memory leak is not dealt with, it will eventually cause problems. Sure, you could just preventatively restart the app servers every day or however often you need to nip the leak in the bud … over and over. But let’s face it, that’s about as appealing as gluten free beer, and will only stave off your pager duty for so long (if you know of a gluten free beer that doesn’t suck, please do let me know!).

So what about the tools I mentioned earlier? Based on Stack Overflow, profilers are a really popular solution. While they definitely can help identify memory leaks, you likely won’t know to use them until you get an out of memory error. They also come at a cost – they are guaranteed to slow down your app quite a bit. Analyzing memory dumps is another popular and effective method to solve memory leaks. Like profilers unfortunately, you won’t know to analyze a memory dump until you are already experiencing an out of memory error. Working with memory dumps has a couple other disadvantages. Taking a dump from a live application makes it unresponsive for your customers, and if the memory dump is taken at the wrong time, it can contain a significant amount of noise.

Application performance management (APM) tools can help in the early identification department. Typically they use either manually set static thresholds, or dynamic thresholds based on simple predictive algorithms such as Holt-Winters. Unfortunately if you set them too aggressive, you’ll be woken up at 3am by false positives, and if you set them too conservative, the early part tends to go away. Finding the right threshold points over time in a dynamically scaling environment where deploys happen often, is a huge pain so most people set them quite conservatively with the knowledge that at least it’s better than nothing (and better than being woken up at 3am!). Identifying memory leaks well before they impact business is the real trick, and that’s where machine learning based anomaly detection shines.

Memory Leak

The “sawtooth” pattern of memory utilization: the sudden drop in used memory is a candidate symptom for a memory leak.

An ever increasing memory usage is the most obvious indicator of an imminent out of memory error. Whether caused by a memory leak, an app storing ever increasing amounts of information in memory as a cache (not a memory leak as the information remains nominally in use), or some other slowly jogging away memory consumption, a good trending based anomaly detection algorithm will alert you long before an APM threshold will. A “sawtooth” pattern of memory utilization may also indicate a memory leak if the vertical drops coincide with reboots or application restarts. Note that garbage collection points could also cause such a pattern and would simply show a healthy usage of the heap.

In my next blog, “Thwarting memory leaks and other out of memory errors with machine learning: Part 3 “, I’ll demonstrate how to use a simple machine learning technique for early detection of memory leak type anomalies.


 

Published at DZone with permission of Jenny Yang, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Peter Huber replied on Wed, 2014/01/15 - 6:35am

 Could you please elaborate a bit more why you think a sawtooth is a memory leak candidate?

Why do I ask? The sawtooth pattern is a very likely pattern if you watch internal memory consumption of a JAVA VM with like VisualVM. Okay, this is not completely comparable because you're refering to the OS-process memory consumption (outside view) whereas I'm speaking of used heap and garbage collection inside a JVM (inside view). But I guess the reason could be quite similar. Work has to be done, memory consumption rises and after work's finished memory could be freed again, like for instance in a web server.

What I've learnt as a pathological symptom is not the perfect sawtooth whose peak and base level stay the same over time, but the levels go up slowly, i.e. every new peak/base is higher then the peak/base before, even when only slightly higher. In this case you have the typical work-has-to-be-done sawtooth pattern, but you could suspect that in the clean up code somebody always forgets to free a even if tiny portion of the dynamically allocated memory. Seen this on some occasions well hidden, meaning that we had to put considerable load on our process for like some days until the pattern clearly shows up in our plots because the memory waste was to tiny to be able to see the long term trend when checking just a few minutes...

Edited:

About the detection of critical situations. I'd say it should be never and is mostly never that simple as to put it on on threshold. Usually you have a fatal threshold, that tells you okay here is something very, very, very wrong. Here it is actually simple. There is no maybe - get the call at 2 am or the server is down...But mostly you have a warning threshold that's quite below the fatal threshold - and for that threshold things should be a bit trickier:

1.) I would not start a phone call ich the warning threshold is touched just with one sample or for just some seconds.

2.) If the threshold is in between warning and fatal thresholds for more than x minutes and the 5 minutes-mean slop is always positive in this period then I would say: raise the failure probablity

3.) And if you have other probes that also get worse and add up to a failure probability then put out the call early before reaching the fatal level of just memory consumption. What could be other probes: Like you have been slashdotted and the page hits on your server skyrocket the same time the memory consumption goes up...And by the way in such a case you see, the memory consumption is not due to a memory leak.

4.) I think most operations monitors like nagios are able to use such combined metrics and some logic to be able to do more complex things than just thresholding.

Deepu Roy replied on Wed, 2014/01/15 - 6:28am

I agree with Peter Huber. A sawtooth where the level's remain the same does not imply a memory leak. The increasing memory in the sawtooth might indicate that the processing is data intensive. When I see a sawtooth pattern in memory utilization, I look for the pattern's trend. Does free memory after the drop reduce over time? This needs to be looked at over a few cycles since the application (or any utility within the application) caches information. The graph on this article wouldn't cause me to worry much. The steep drops correspond to GC cycles and the free memory remains roughly the same at these points.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.