Thwarting memory leaks and other out of memory errors with machine learning: Part 2
In my previous blog I discussed why memory leaks are like scotch. In this post I’ll discuss various ways to identify memory leaks.
When an app has a memory leak and its memory usage is steadily increasing, there will not usually be an immediate symptom. Every system, virtual or physical, has a finite amount of memory (yes there are fancy tricks to dynamically increase memory almost infinitely, but latency makes these impractical most of the time), and if the memory leak is not dealt with, it will eventually cause problems. Sure, you could just preventatively restart the app servers every day or however often you need to nip the leak in the bud … over and over. But let’s face it, that’s about as appealing as gluten free beer, and will only stave off your pager duty for so long (if you know of a gluten free beer that doesn’t suck, please do let me know!).
So what about the tools I mentioned earlier? Based on Stack Overflow, profilers are a really popular solution. While they definitely can help identify memory leaks, you likely won’t know to use them until you get an out of memory error. They also come at a cost – they are guaranteed to slow down your app quite a bit. Analyzing memory dumps is another popular and effective method to solve memory leaks. Like profilers unfortunately, you won’t know to analyze a memory dump until you are already experiencing an out of memory error. Working with memory dumps has a couple other disadvantages. Taking a dump from a live application makes it unresponsive for your customers, and if the memory dump is taken at the wrong time, it can contain a significant amount of noise.
Application performance management (APM) tools can help in the early identification department. Typically they use either manually set static thresholds, or dynamic thresholds based on simple predictive algorithms such as Holt-Winters. Unfortunately if you set them too aggressive, you’ll be woken up at 3am by false positives, and if you set them too conservative, the early part tends to go away. Finding the right threshold points over time in a dynamically scaling environment where deploys happen often, is a huge pain so most people set them quite conservatively with the knowledge that at least it’s better than nothing (and better than being woken up at 3am!). Identifying memory leaks well before they impact business is the real trick, and that’s where machine learning based anomaly detection shines.
The “sawtooth” pattern of memory utilization: the sudden drop in used memory is a candidate symptom for a memory leak.
An ever increasing memory usage is the most obvious indicator of an imminent out of memory error. Whether caused by a memory leak, an app storing ever increasing amounts of information in memory as a cache (not a memory leak as the information remains nominally in use), or some other slowly jogging away memory consumption, a good trending based anomaly detection algorithm will alert you long before an APM threshold will. A “sawtooth” pattern of memory utilization may also indicate a memory leak if the vertical drops coincide with reboots or application restarts. Note that garbage collection points could also cause such a pattern and would simply show a healthy usage of the heap.
In my next blog, “Thwarting memory leaks and other out of memory errors with machine learning: Part 3 “, I’ll demonstrate how to use a simple machine learning technique for early detection of memory leak type anomalies.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)