[CLJ-1152] PermGen leak in multimethods and protocol fns when evaled Created: 30/Jan/13 Updated: 10/Dec/13
|Affects Version/s:||Release 1.4|
There is a PermGen memory leak that we have tracked down to protocol methods and multimethods called inside an eval, because of the caches these methods use. The problem only arises when the value being cached is an instance of a class (such as a function or reify) that was defined inside the eval. Thus extending IFn or dispatching a multimethod on an IFn are likely triggers.
My fellow LonoClouder, Jeff Dik describes how to reproduce and work around the problem:
The easiest way that I have found to test this is to set "-XX:MaxPermSize" to a reasonable value so you don't have to wait too long for the PermGen space to fill up, and to use "-XX:+TraceClassLoading" and "-XX:+TraceClassUnloading" to see the classes being loaded and unloaded.
You can use lein swank 45678 and connect with slime in emacs via M-x slime-connect.
To monitor the PermGen usage, you can find the Java process to watch with "jps -lmvV" and then run "jstat -gcold 1s". According to the jstat docs, the first column (PC) is the "Current permanent space capacity (KB)" and the second column (PU) is the "Permanent space utilization (KB)". VisualVM is also a nice tool for monitoring this.
Evaluating the following code will run a loop that eval's (take* (fn foo )).
In the lein swank session, you will see many lines like below listing the classes being created and loaded.
These lines will stop once the PermGen space fills up.
In the jstat monitoring, you'll see the amount of used PermGen space (PU) increase to the max and stay there.
A workaround is to run prefer-method before the PermGen space is all used up, e.g.
Then, when the used PermGen space is close to the max, in the lein swank session, you will see the classes created by the eval'ing being unloaded.
In the jstat monitoring, there will be a long pause when used PermGen space stays close to the max, and then it will drop down, and start increasing again when more eval'ing occurs.
The defmulti defines a cache that uses the dispatch values as keys. Each eval call in the loop defines a new foo class which is then added to the cache when take* is called, preventing the class from ever being GCed.
The prefer-method workaround works because it calls clojure.lang.MultiFn.preferMethod, which calls the private MultiFn.resetCache method, which completely empties the cache.
The leak with protocol methods similarly involves a cache. You see essentially the same behavior as the multimethod leak if you run the following code using protocols.
Again, the cache is in the take* method itself, using each new foo class as a key.
A workaround is to run -reset-methods on the protocol before the PermGen space is all used up, e.g.
This works because -reset-methods replaces the cache with an empty MethodImplCache.
|Comment by Chouser [ 30/Jan/13 9:10 AM ]|
I think the most obvious solution would be to constrain the size of the cache. Adding an item to the cache is already not the fastest path, so a bit more work could be done to prevent the cache from growing indefinitely large.
That does raise the question of what criteria to use. Keep the first n entries? Keep the n most recently used (which would require bookkeeping in the fast cache-hit path)? Keep the n most recently added?
|Comment by Jamie Stephens [ 18/Oct/13 9:35 AM ]|
At a minimum, perhaps a switch to disable the caches – with obvious performance impact caveats.
Seems like expensive LRU logic is probably the way to go, but maybe don't have it kick in fully until some threshold is crossed.
|Comment by Alex Miller [ 18/Oct/13 4:28 PM ]|
A report seeing this in production from mailing list:
|Comment by Adrian Medina [ 10/Dec/13 11:43 AM ]|
So this is why we've been running into PermGen space exceptions! This is a fairly critical bug for us - I'm making extensive use of multimethods in our codebase and this exception will creep in at runtime randomly.