[CLJ-1125] Clojure can leak memory when used in a servlet container Created: 11/Dec/12 Updated: 11/Jan/14 Resolved: 11/Jan/14
|Fix Version/s:||Release 1.6|
|Attachments:||threadlocal-removal-tcrawley-2012-12-11.diff threadlocal-removal-tcrawley-2013-06-14.diff threadlocal-removal-tcrawley-2013-11-24.diff|
When used within a servlet container (Jetty/Tomcat/JBossAS/Immutant/etc), the thread locals Var.dvals (used to store dynamic bindings) and LockingTransaction.transaction (used to store the currently active transaction(s)) prevent all of the classes loaded by an application's clojure runtime from being garbage collected, resulting in a memory leak.
Cause: The issue comes from threads living beyond the lifetime of a deployment - servlet containers use thread pools that are shared across all applications within the container. Currently, the dvals and transaction thread locals are not discarded when they are no longer needed, causing their contents to retain a hard reference to their classloaders, which, in turn, causes all of the classes loaded under the application's classloader to be retained until the thread exits (which is generally at JVM shutdown).
Solution: I've attached a patch that does the following:
There is still the opportunity for memory leaks if agents or futures are used, and the executors used for them are not shutdown when the app is undeployed. That's a solvable problem, but should probably be solved by the containers themselves (and/or the war generation tools) instead of in clojure itself.
This patch has a small performance impact: its use of a try/finally around running transactions to remove the outer transaction adds 4-6 microseconds to each transaction call on my hardware.
Providing an automated test for this patch is difficult - I've tested it locally with repeated deployments to a container while monitoring GC and permgen. All of clojure's tests pass with it applied.
The above is a condensation of:
Screened by: Alex Miller - the new patch (since prior screening) has no changes in the LockingTransaction code but has been updated in Var to address the regression logged in
|Comment by Colin Jones [ 13/May/13 7:30 PM ]|
This patch works great for me to avoid OOM/PermGen crashes from classloaders being retained [mine is a non-servlet use case].
|Comment by Stuart Halloway [ 24/May/13 9:43 AM ]|
Does Tomcat create warnings for Clojure, as described e.g. here?
If so, does this patch make the warnings go away?
|Comment by Toby Crawley [ 24/May/13 9:56 AM ]|
Stu: that's a good question. I'll take a look at Tomcat this afternoon.
|Comment by Stuart Halloway [ 24/May/13 10:04 AM ]|
The code that calls transaction.remove() seems unncessarily subtle. There are two exits from the method, and only one is protected by the finally block.
If the "outer" case was a top-level if, the logic would be more clear, and only the "outer" case would need try/finally, which might reduce the performance penalty in the case of deeply nested dosyncs.
Did your transaction overhead of 4-6 microseconds test only one level of dosync, or many?
|Comment by Stuart Halloway [ 24/May/13 10:13 AM ]|
Because the unwind code calls remove at the top (as opposed to set(null)), the code should now be safe for use with Clojure-defined ThreadLocal subclasses.
Therefore, Var's use of an initialValue should be irrelevant to this patch, and it should be possible to fix this bug with a patch half the size of the current patch, touching only LockingTransaction.runInTransaction and Var.popThreadBindings.
|Comment by Toby Crawley [ 14/Jun/13 7:38 AM ]|
With Clojure 1.5.1 using my test app (linked below), I see:
With the original patch (threadlocal-removal-tcrawley-2012-12-11.diff) and the one attached today (threadlocal-removal-tcrawley-2013-06-14.diff), I no longer see these warnings.
In today's patch (threadlocal-removal-tcrawley-2013-06-14.diff), I modified runInTransaction to have one exit point, and only wrap a call to run with a try/finally in the outer transaction case. It does introduce two locations where run can be called to preserve the case where an inner transaction has null info:
However, this will likely not reduce the speed penalty I observed in my testing, as I was only using a single level of dosync when capturing timing data.
My original solution kept initialValue, but I then apparently discovered cases where the leak still occurred (see the mailing list thread).
Unfortunately, I can neither recreate that case, nor find in my notes, test code, or the clojure code a reason why keeping initialValue would allow the ThreadLocals to leak when popThreadBindings is patched (assuming one doesn't call Var.getThreadBindings from Java without calling Var.popThreadBindings).
Therefore, I've attached a simpler patch (threadlocal-removal-tcrawley-2013-06-14.diff) that just patches LockingTransaction.runInTransaction and Var.popThreadBindings.
The patched version of 1.6.0-master is available as [org.clojars.tcrawley/clojure "1.6.0-clearthreadlocals"] if anyone wants to give it a try in their own projects. Note that since its group isn't 'org.clojure', you may need to add exclusions to your project to prevent another version of clojure being included.
|Comment by Andy Fingerhut [ 14/Jun/13 10:56 AM ]|
Presumptuously changing ticket approval from Incomplete back to its former Vetted state, since Toby's comments and new patch seem to address the comments that led Stu to change it to Incomplete.
|Comment by Toby Crawley [ 02/Aug/13 10:30 AM ]|
Is there anything else you need from me for this to be applied?
|Comment by Chas Emerick [ 04/Aug/13 5:52 PM ]|
FWIW, using Toby's Clojure dep with Immutant has eliminated the out-of-permgen errors I used to occasionally get after N app redeployments.
|Comment by Alex Miller [ 23/Aug/13 11:08 AM ]|
I looked at the updated patch and it seems good to me. In the LockingTransaction.runinTransaction code the cases are driven by where t=null and t.info=null. Of those 4 cases, I believe the same call is being made in all but the case of t == null (where a new LockingTransaction is created) and t.info != null. However, I believe since a new txn is created and t.info should start as null, this case does not actually exist in practice.
Greatly appreciate Chas's experience feedback and all of Toby and Stu's work to make this change solid!
|Comment by Alex Miller [ 22/Nov/13 7:59 PM ]|
Reverted in 1.6.0-alpha3 based on
|Comment by Toby Crawley [ 24/Nov/13 5:22 PM ]|
I just attached a new patch (threadlocal-removal-tcrawley-2013-11-24.diff) that achieves the same ThreadLocal removal as the previous patch, but addresses the issues with binding conveyance reported in