<< Back to previous view

[CLJ-1125] Clojure can leak memory when used in a servlet container Created: 11/Dec/12  Updated: 11/Jan/14  Resolved: 11/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: None
Fix Version/s: Release 1.6

Type: Defect Priority: Critical
Reporter: Toby Crawley Assignee: Unassigned
Resolution: Completed Votes: 14
Labels: memory

Attachments: File threadlocal-removal-tcrawley-2012-12-11.diff     File threadlocal-removal-tcrawley-2013-06-14.diff     File threadlocal-removal-tcrawley-2013-11-24.diff    
Patch: Code
Approval: Ok

 Description   

When used within a servlet container (Jetty/Tomcat/JBossAS/Immutant/etc), the thread locals Var.dvals (used to store dynamic bindings) and LockingTransaction.transaction (used to store the currently active transaction(s)) prevent all of the classes loaded by an application's clojure runtime from being garbage collected, resulting in a memory leak.

Cause: The issue comes from threads living beyond the lifetime of a deployment - servlet containers use thread pools that are shared across all applications within the container. Currently, the dvals and transaction thread locals are not discarded when they are no longer needed, causing their contents to retain a hard reference to their classloaders, which, in turn, causes all of the classes loaded under the application's classloader to be retained until the thread exits (which is generally at JVM shutdown).

Solution: I've attached a patch that does the following:

  • Var.dvals is initialized to a canonical TOP Frame
  • Var.dvals is now removed when the thread bindings are popped to the TOP
  • The outer transaction in LockingTransaction.transaction now removes the thread local when it is finished

There is still the opportunity for memory leaks if agents or futures are used, and the executors used for them are not shutdown when the app is undeployed. That's a solvable problem, but should probably be solved by the containers themselves (and/or the war generation tools) instead of in clojure itself.

This patch has a small performance impact: its use of a try/finally around running transactions to remove the outer transaction adds 4-6 microseconds to each transaction call on my hardware.

Providing an automated test for this patch is difficult - I've tested it locally with repeated deployments to a container while monitoring GC and permgen. All of clojure's tests pass with it applied.

The above is a condensation of:
https://groups.google.com/d/topic/clojure-dev/3CXDe8_9G58/discussion

Patch: threadlocal-removal-tcrawley-2013-11-24.diff

Screened by: Alex Miller - the new patch (since prior screening) has no changes in the LockingTransaction code but has been updated in Var to address the regression logged in CLJ-1299.



 Comments   
Comment by Colin Jones [ 13/May/13 7:30 PM ]

This patch works great for me to avoid OOM/PermGen crashes from classloaders being retained [mine is a non-servlet use case].

Comment by Stuart Halloway [ 24/May/13 9:43 AM ]

Does Tomcat create warnings for Clojure, as described e.g. here?

If so, does this patch make the warnings go away?

Comment by Toby Crawley [ 24/May/13 9:56 AM ]

Stu: that's a good question. I'll take a look at Tomcat this afternoon.

Comment by Stuart Halloway [ 24/May/13 10:04 AM ]

The code that calls transaction.remove() seems unncessarily subtle. There are two exits from the method, and only one is protected by the finally block.

If the "outer" case was a top-level if, the logic would be more clear, and only the "outer" case would need try/finally, which might reduce the performance penalty in the case of deeply nested dosyncs.

Did your transaction overhead of 4-6 microseconds test only one level of dosync, or many?

Comment by Stuart Halloway [ 24/May/13 10:13 AM ]

Because the unwind code calls remove at the top (as opposed to set(null)), the code should now be safe for use with Clojure-defined ThreadLocal subclasses.

Therefore, Var's use of an initialValue should be irrelevant to this patch, and it should be possible to fix this bug with a patch half the size of the current patch, touching only LockingTransaction.runInTransaction and Var.popThreadBindings.

Comment by Toby Crawley [ 14/Jun/13 7:38 AM ]

re: Tomcat ThreadLocal warnings

With Clojure 1.5.1 using my test app (linked below), I see:

Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [clojure.lang.Var$1] (value [clojure.lang.Var$1@4902919]) and a value of type [clojure.lang.Var.Frame] (value [clojure.lang.Var$Frame@147a2aa6]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@608602ca]) and a value of type [clojure.lang.LockingTransaction] (value [clojure.lang.LockingTransaction@7e214d47]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

With the original patch (threadlocal-removal-tcrawley-2012-12-11.diff) and the one attached today (threadlocal-removal-tcrawley-2013-06-14.diff), I no longer see these warnings.

re: the LockingTransaction.runInTransaction changes

In today's patch (threadlocal-removal-tcrawley-2013-06-14.diff), I modified runInTransaction to have one exit point, and only wrap a call to run with a try/finally in the outer transaction case. It does introduce two locations where run can be called to preserve the case where an inner transaction has null info:

static public Object runInTransaction(Callable fn) throws Exception{
	LockingTransaction t = transaction.get();
        Object ret;
	if(t == null) {
            transaction.set(t = new LockingTransaction());
            try {
                ret = t.run(fn);
            } finally {
                transaction.remove();
            }
        } else {
            if(t.info != null) {
                ret = fn.call();
            } else {
                ret = t.run(fn);
            }
        }

        return ret;
}

However, this will likely not reduce the speed penalty I observed in my testing, as I was only using a single level of dosync when capturing timing data.

re: removing initialValue from dvals

My original solution kept initialValue, but I then apparently discovered cases where the leak still occurred (see the mailing list thread).

Unfortunately, I can neither recreate that case, nor find in my notes, test code, or the clojure code a reason why keeping initialValue would allow the ThreadLocals to leak when popThreadBindings is patched (assuming one doesn't call Var.getThreadBindings from Java without calling Var.popThreadBindings).

Therefore, I've attached a simpler patch (threadlocal-removal-tcrawley-2013-06-14.diff) that just patches LockingTransaction.runInTransaction and Var.popThreadBindings.

I've also created a project that demonstrates the leak with 1.5.1, and that the leak does not appear with this patch applied to 1.6.0-master. See its README for usage details.

The patched version of 1.6.0-master is available as [org.clojars.tcrawley/clojure "1.6.0-clearthreadlocals"] if anyone wants to give it a try in their own projects. Note that since its group isn't 'org.clojure', you may need to add exclusions to your project to prevent another version of clojure being included.

Comment by Andy Fingerhut [ 14/Jun/13 10:56 AM ]

Presumptuously changing ticket approval from Incomplete back to its former Vetted state, since Toby's comments and new patch seem to address the comments that led Stu to change it to Incomplete.

Comment by Toby Crawley [ 02/Aug/13 10:30 AM ]

Stu:

Is there anything else you need from me for this to be applied?

Comment by Chas Emerick [ 04/Aug/13 5:52 PM ]

FWIW, using Toby's Clojure dep with Immutant has eliminated the out-of-permgen errors I used to occasionally get after N app redeployments.

Comment by Alex Miller [ 23/Aug/13 11:08 AM ]

I looked at the updated patch and it seems good to me. In the LockingTransaction.runinTransaction code the cases are driven by where t=null and t.info=null. Of those 4 cases, I believe the same call is being made in all but the case of t == null (where a new LockingTransaction is created) and t.info != null. However, I believe since a new txn is created and t.info should start as null, this case does not actually exist in practice.

Greatly appreciate Chas's experience feedback and all of Toby and Stu's work to make this change solid!

Marking screened.

Comment by Alex Miller [ 22/Nov/13 7:59 PM ]

Reverted in 1.6.0-alpha3 based on CLJ-1299 report.

Comment by Toby Crawley [ 24/Nov/13 5:22 PM ]

I just attached a new patch (threadlocal-removal-tcrawley-2013-11-24.diff) that achieves the same ThreadLocal removal as the previous patch, but addresses the issues with binding conveyance reported in CLJ-1299.

Generated at Wed Nov 26 11:38:44 CST 2014 using JIRA 4.4#649-r158309.