<< Back to previous view

[CLJ-1465] SubVector leaks memory by design (?) Created: 14/Jul/14  Updated: 14/Jul/14  Resolved: 14/Jul/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Szymon Witamborski Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: memory


 Description   

While reading through the code of SubVector class, I've noticed that the only thing that it does is creating a "subview" of an existing vector without doing anything to elements that are not accessible any more. Please correct me if I'm wrong but I think this leads to memory leaks in this scenario:

(let [a [1 2 3 4 5]
      b (subvec a 2 3)]
  b)
[3]

In this example we no longer have any reference to a once the (let...) expression returned. Elements of a that are no longer accessible will not be garbage collected until b is garbage collected because b still holds a reference to a (the v field in SubVector class). Ideally, these elements should be garbage collectible as soon as a is gone.



 Comments   
Comment by Alex Miller [ 14/Jul/14 2:40 PM ]

The subvec docstring says:

"Returns a persistent vector of the items in vector from start (inclusive) to end (exclusive). If end is not supplied, defaults to (count vector). This operation is O(1) and very fast, as the resulting vector shares structure with the original and no trimming is done."

The implementation is intentional to make this a constant-time operation. If you are willing to make the tradeoff re shared structure and object retention, this constant operation has better performance. In other words: working as intended.





[CLJ-1366] The empty map literal is read as a different map each time Created: 01/Mar/14  Updated: 02/Mar/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5, Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: OHTA Shogo Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: memory, reader

Attachments: Text File 0001-make-the-reader-return-the-same-empty-map-when-it-re.patch     Text File 0002-make-the-reader-return-the-same-empty-map-when-it-re.patch    
Patch: Code

 Description   

As reported here (https://groups.google.com/forum/?hl=en#!topic/clojure-dev/n83hlRFsfHg), the empty map literal is read as a different map each time.

user=> (identical? (read-string "{}") (read-string "{}"))
false

Making the reader return the same empty map when it reads an empty map is expected to improve some memory efficiency, and also lead to consistency with the way other collection literals are read in.

user=> (identical? (read-string "()") (read-string "()"))
true
user=> (identical? (read-string "[]") (read-string "[]"))
true
user=> (identical? (read-string "#{}") (read-string "#{}"))
true

Cause: LispReader calls RT.map() with an empty array when it reads an empty map, and RT.map() in turn makes a new map unless its argument given is null.

Approach: make RT.map() return the same empty map when the argument is an empty array as well, not only when null



 Comments   
Comment by OHTA Shogo [ 01/Mar/14 2:59 AM ]

Sorry, the patch 0001-make-the-reader-return-the-same-empty-map-when-it-re.patch didn't work.

The updated patch 0002-make-the-reader-return-the-same-empty-map-when-it-re.patch works, but I'm afraid it'd be beyond the scope of this ticket since it modifies RT.map() behavior a bit.





[CLJ-1250] Reducer (and folder) instances hold onto the head of seqs Created: 03/Sep/13  Updated: 27/Jun/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: Release 1.7

Type: Defect Priority: Critical
Reporter: Christophe Grand Assignee: Unassigned
Resolution: Unresolved Votes: 8
Labels: memory, reducers

Attachments: Text File after-change.txt     Text File before-change.txt     Text File CLJ-1250-20131211.patch     Text File CLJ-1250-AllInvokeSites-20140204.patch     Text File CLJ-1250-AllInvokeSites-20140320.patch    
Patch: Code
Approval: Vetted

 Description   

Problem Statement
A shared function #'clojure.core.reducers/reducer holds on to the head of a reducible collection, causing it to blow up when the collection is a lazy sequence.

Cause: #'reducer closes over a collection when in order to reify CollReduce, and the closed-over is never cleared. When code attempts to reduce over this anonymous transformed collection, it will realize the tail while the head is stored in the closed-over.

Reproduction steps:
Compare the following calls:

(time (reduce + 0 (map identity (range 1e8))))
(time (reduce + 0 (r/map identity (range 1e8))))

The second call should fail on a normal or small heap.

(If reducers are faster than seqs, increase the range.)

Patch
CLJ-1250-AllInvokeSites-20140320.patch takes Approach #2

Approaches:

1) Reimplement the #'reducer (and #'folder) transformation fns similar to the manner that Christophe proposes here:

(defrecord Reducer [coll xf])

(extend-protocol 
  clojure.core.protocols/CollReduce
  Reducer
      (coll-reduce [r f1]
                   (clojure.core.protocols/coll-reduce r f1 (f1)))
      (coll-reduce [r f1 init]
                   (clojure.core.protocols/coll-reduce (:coll r) ((:xf r) f1) init)))

(def rreducer ->Reducer) 

(defn rmap [f coll]
  (rreducer coll (fn [g] 
                   (fn [acc x]
                     (g acc (f x))))))

Advantages: Relatively non-invasive change.
Disadvantages: Not evident code. Additional protocol dispatch, though only incurred once

2) Clear the reference to 'this' on the stack just before a tail call occurs

When a callee is in return position, clear the local variable reference to 'this' in the caller's stack frame, which will make the caller and all its closed-overs eligible for reclamation.

Patch 1211 takes this approach for InvokeExpr call sites in return position. Patch 1214 takes the same approach for InvokeExprs and also static and instance interop calls.

Here is the code that performs the clearing excerpted from the 1214 patch:

void emitClearThis(GeneratorAdapter gen) {
		gen.visitInsn(Opcodes.ACONST_NULL);
		gen.visitVarInsn(Opcodes.ASTORE, 0);
	}

Tail calls wrapped inside a try/catch/finally clause cannot have 'this' cleared, because closed-overs/locals may need to be emitted for exception handling blocks. Both patches consider and handle this edge case.

Advantages: Fixes this case with no user code changes. Enables GC to do reclaim closed-overs references earlier.
Disadvantages: A compiler change.

3) Alternate approach

from Christophe Grand:
Another way would be to enhance the local clearing mechanism to also clear "this" but it's complex since it may be needed to keep a reference to "this" for caches long after obvious references to "this" are needed.

Advantages: Fine-grained
Disadvantages: Complex, invasive, and the compiler is hard to hack on.

Mitigations
Avoid reducing on lazy-seqs and instead operate on vectors / maps, or custom reifiers of CollReduce or CollFold. This could be easier with some implementations of common collection functions being available (like iterate and partition).

See https://groups.google.com/d/msg/clojure-dev/t6NhGnYNH1A/2lXghJS5HywJ for previous discussion.



 Comments   
Comment by Gary Fredericks [ 03/Sep/13 8:53 AM ]

Fixed indentation in description.

Comment by Ghadi Shayban [ 11/Dec/13 11:08 PM ]

Adding a patch that clears "this" before tail calls. Verified that Christophe's repro case is fixed.

Will upload a diff of the bytecode soon.

Any reason this juicy bug was taken off 1.6?

Comment by Ghadi Shayban [ 11/Dec/13 11:17 PM ]

Here's the bytecode for the clojure.core.reducers/reducer reify before and after the change... Of course a straight diff isn't useful because all the line numbers changed. Kudos to Gary Trakhman for the no.disassemble lein plugin.

Comment by Christophe Grand [ 12/Dec/13 6:58 AM ]

Ghadi, I'm a bit surprised by this part of the patch: was the local clearing always a no-op here?

-		if(context == C.RETURN)
+		if(shouldClear)
 			{
-			ObjMethod method = (ObjMethod) METHOD.deref();
-			method.emitClearLocals(gen);
+                            gen.visitInsn(Opcodes.ACONST_NULL);
+                            gen.visitVarInsn(Opcodes.ASTORE, 0);
 			}

The problem with this approach (clear this on tail call) is that it adds yet another special case. To me the complexity stem from having to keep this around even if the user code doesn't refer to it.

Comment by Ghadi Shayban [ 12/Dec/13 7:19 AM ]

Thank you - I failed to mention this in the commit message: it appears that emitClearLocals() belonging to both ObjMethod and FnMethod (its child) are empty no-ops. I believe the actual local clearing is on line 4855.

I agree re: another special case in the compiler.

Comment by Alex Miller [ 12/Dec/13 8:56 AM ]

Ghadi re 1.6 - this ticket was never in the 1.6 list, it has not yet been vetted by Rich but is ready to do so when we open up again after 1.6.

Comment by Ghadi Shayban [ 12/Dec/13 8:59 AM ]

Sorry I confused the critical list with the Rel1.6 list.

Comment by Ghadi Shayban [ 14/Dec/13 11:16 AM ]

New patch 20131214 that handles all tail invoke sites (InvokeExpr + StaticMethodExpr + InstanceMethodExpr). 'StaticInvokeExpr' seems like an old remnant that had no active code path, so that was left as-is.

The approach taken is still the same as the original small patch that addressed only InvokeExpr, except that it is now using a couple small helpers. The commit message has more details.

Also a 'try' block with no catch or finally clause now becomes a BodyExpr. Arguably a user error, historically accepted, and still accepted, but now they are a regular BodyExpr, instead of being wrapped by a the no-op try/catch mechanism. This second commit can be optionally discarded.

With this patch on my machine (4/8 core/thread Ivy Bridge) running on bare clojure.main:
Christophe's test cases both run i 3060ms on a artificially constrained 100M max heap, indicating a dominant GC overhead. (But they now both work!)

When max heap is at a comfortable 2G the reducers version outpaces the lazyseq at 2100ms vs 2600ms!

Comment by Ghadi Shayban [ 13/Jan/14 10:48 AM ]

Updating stale patch after latest changes to master. Latest is CLJ-1250-AllInvokeSites-20140113

Comment by Ghadi Shayban [ 04/Feb/14 3:50 PM ]

Updating patch after murmur changes

Comment by Tassilo Horn [ 13/Feb/14 4:52 AM ]

Ghadi, I suffer from the problem of this issue. Therefore, I've applied your patch CLJ-1250-AllInvokeSites-20140204.patch to the current git master. However, then I get lots of "java.lang.NoSuchFieldError: array" errors when the clojure tests are run:

     [java] clojure.test-clojure.clojure-set
     [java] 
     [java] java.lang.NoSuchFieldError: array
     [java] 	at clojure.core.protocols$fn__6026.invoke(protocols.clj:123)
     [java] 	at clojure.core.protocols$fn__5994$G__5989__6003.invoke(protocols.clj:19)
     [java] 	at clojure.core.protocols$fn__6023.invoke(protocols.clj:147)
     [java] 	at clojure.core.protocols$fn__5994$G__5989__6003.invoke(protocols.clj:19)
     [java] 	at clojure.core.protocols$seq_reduce.invoke(protocols.clj:31)
     [java] 	at clojure.core.protocols$fn__6017.invoke(protocols.clj:48)
     [java] 	at clojure.core.protocols$fn__5968$G__5963__5981.invoke(protocols.clj:13)
     [java] 	at clojure.core$reduce.invoke(core.clj:6213)
     [java] 	at clojure.set$difference.doInvoke(set.clj:61)
     [java] 	at clojure.lang.RestFn.invoke(RestFn.java:442)
     [java] 	at clojure.test_clojure.clojure_set$fn__1050$fn__1083.invoke(clojure_set.clj:109)
     [java] 	at clojure.test_clojure.clojure_set$fn__1050.invoke(clojure_set.clj:109)
     [java] 	at clojure.test$test_var$fn__7123.invoke(test.clj:704)
     [java] 	at clojure.test$test_var.invoke(test.clj:704)
     [java] 	at clojure.test$test_vars$fn__7145$fn__7150.invoke(test.clj:721)
     [java] 	at clojure.test$default_fixture.invoke(test.clj:674)
     [java] 	at clojure.test$test_vars$fn__7145.invoke(test.clj:721)
     [java] 	at clojure.test$default_fixture.invoke(test.clj:674)
     [java] 	at clojure.test$test_vars.invoke(test.clj:718)
     [java] 	at clojure.test$test_all_vars.invoke(test.clj:727)
     [java] 	at clojure.test$test_ns.invoke(test.clj:746)
     [java] 	at clojure.core$map$fn__2665.invoke(core.clj:2515)
     [java] 	at clojure.lang.LazySeq.sval(LazySeq.java:40)
     [java] 	at clojure.lang.LazySeq.seq(LazySeq.java:49)
     [java] 	at clojure.lang.Cons.next(Cons.java:39)
     [java] 	at clojure.lang.RT.boundedLength(RT.java:1655)
     [java] 	at clojure.lang.RestFn.applyTo(RestFn.java:130)
     [java] 	at clojure.core$apply.invoke(core.clj:619)
     [java] 	at clojure.test$run_tests.doInvoke(test.clj:761)
     [java] 	at clojure.lang.RestFn.applyTo(RestFn.java:137)
     [java] 	at clojure.core$apply.invoke(core.clj:617)
     [java] 	at clojure.test.generative.runner$run_all_tests$fn__527.invoke(runner.clj:255)
     [java] 	at clojure.test.generative.runner$run_all_tests$run_with_counts__519$fn__523.invoke(runner.clj:251)
     [java] 	at clojure.test.generative.runner$run_all_tests$run_with_counts__519.invoke(runner.clj:251)
     [java] 	at clojure.test.generative.runner$run_all_tests.invoke(runner.clj:253)
     [java] 	at clojure.test.generative.runner$test_dirs.doInvoke(runner.clj:304)
     [java] 	at clojure.lang.RestFn.applyTo(RestFn.java:137)
     [java] 	at clojure.core$apply.invoke(core.clj:617)
     [java] 	at clojure.test.generative.runner$_main.doInvoke(runner.clj:312)
     [java] 	at clojure.lang.RestFn.invoke(RestFn.java:408)
     [java] 	at user$eval564.invoke(run_tests.clj:3)
     [java] 	at clojure.lang.Compiler.eval(Compiler.java:6657)
     [java] 	at clojure.lang.Compiler.load(Compiler.java:7084)
     [java] 	at clojure.lang.Compiler.loadFile(Compiler.java:7040)
     [java] 	at clojure.main$load_script.invoke(main.clj:274)
     [java] 	at clojure.main$script_opt.invoke(main.clj:336)
     [java] 	at clojure.main$main.doInvoke(main.clj:420)
     [java] 	at clojure.lang.RestFn.invoke(RestFn.java:408)
     [java] 	at clojure.lang.Var.invoke(Var.java:379)
     [java] 	at clojure.lang.AFn.applyToHelper(AFn.java:154)
     [java] 	at clojure.lang.Var.applyTo(Var.java:700)
     [java] 	at clojure.main.main(main.java:37)
Comment by Ghadi Shayban [ 13/Feb/14 8:23 AM ]

Can you give some details about your JVM/environment that can help reproduce? I'm not encountering this error.

Comment by Tassilo Horn [ 13/Feb/14 9:41 AM ]

Sure. It's a 64bit ThinkPad running GNU/Linux.

% java -version
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.5) (ArchLinux build 7.u51_2.4.5-1-x86_64)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Comment by Ghadi Shayban [ 13/Feb/14 10:19 AM ]

Strange, that is exactly my mail env, OpenJDK7 on Arch, 64-bit. I have also tested on JDK 6/7/8 on OSX mavericks. Are you certain that the git tree is clean besides the patch? (Arch users unite!)

Comment by Tassilo Horn [ 14/Feb/14 1:13 AM ]

Yes, the tree is clean. But now I see that I get the same error also after resetting to origin/master, so it's not caused by your patch at all. Oh, now the error vanished after doing a `mvn clean`! So problem solved.

Comment by Nicola Mometto [ 19/Feb/14 12:32 PM ]

Ghandi, FnExpr.parse should bind IN_TRY_BLOCK to false before analyzing the fn body, consider the case

(try (do something (fn a [] (heap-consuming-op a))) (catch Exception e ..))

Here in the a function the this local will never be cleared even though it's perfectly safe to.
Admittedly this is an edge case but we should cover this possibility too.

Comment by Ghadi Shayban [ 19/Feb/14 2:06 PM ]

You may have auto-corrected my name to Ghandi instead of Ghadi. I wish I were that wise =)

I will update the patch for FnExpr (that seems reasonable), but maybe after 1.6 winds down and the next batch of tickets get scrutiny. It would be nice to get input on a preferred approach from Rich or core after it gets vetted – or quite possibly not vetted.

Comment by Nicola Mometto [ 19/Feb/14 6:11 PM ]

hah, sorry for the typo on the name

Seems reasonable to me, in the meantime I just pushed to tools.analyzer/tools.emitter complete support for "this" clearing, I'll test this a bit in the next few days to make sure it doesn't cause unexpected problems.

Comment by Andy Fingerhut [ 24/Feb/14 12:13 PM ]

Patch CLJ-1250-AllInvokeSites-20140204.patch no longer applies cleanly to latest master as of Feb 23, 2014. It did on Feb 14, 2014. Most likely some of its context lines are changed by the commit to Clojure master on Feb 23, 2014 – I haven't checked in detail.

Comment by Ghadi Shayban [ 20/Mar/14 4:39 PM ]

Added a patch that 1) applies cleanly, 2) binds the IN_TRY_EXPR to false initially when analyzing FnExpr and 3) uses RT.booleanCast





[CLJ-1152] PermGen leak in multimethods and protocol fns when evaled Created: 30/Jan/13  Updated: 26/Jul/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.4
Fix Version/s: Release 1.7

Type: Defect Priority: Critical
Reporter: Chouser Assignee: Unassigned
Resolution: Unresolved Votes: 5
Labels: memory, protocols

Attachments: File naive-lru-method-cache-for-multimethods.diff    
Approval: Vetted

 Description   

There is a PermGen memory leak that we have tracked down to protocol methods and multimethods called inside an eval, because of the caches these methods use. The problem only arises when the value being cached is an instance of a class (such as a function or reify) that was defined inside the eval. Thus extending IFn or dispatching a multimethod on an IFn are likely triggers.

My fellow LonoClouder, Jeff Dik describes how to reproduce and work around the problem:

The easiest way that I have found to test this is to set "-XX:MaxPermSize" to a reasonable value so you don't have to wait too long for the PermGen space to fill up, and to use "-XX:+TraceClassLoading" and "-XX:+TraceClassUnloading" to see the classes being loaded and unloaded.

leiningen project.clj
(defproject permgen-scratch "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.5.0-RC1"]]
  :jvm-opts ["-XX:MaxPermSize=32M"
             "-XX:+TraceClassLoading"
             "-XX:+TraceClassUnloading"])

You can use lein swank 45678 and connect with slime in emacs via M-x slime-connect.

To monitor the PermGen usage, you can find the Java process to watch with "jps -lmvV" and then run "jstat -gcold <PROCESS_ID> 1s". According to the jstat docs, the first column (PC) is the "Current permanent space capacity (KB)" and the second column (PU) is the "Permanent space utilization (KB)". VisualVM is also a nice tool for monitoring this.

Multimethod leak

Evaluating the following code will run a loop that eval's (take* (fn foo [])).

multimethod leak
(defmulti take* (fn [a] (type a)))

(defmethod take* clojure.lang.Fn
  [a]
  '())

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

In the lein swank session, you will see many lines like below listing the classes being created and loaded.

[Loaded user$eval15802$foo__15803 from __JVM_DefineClass__]
[Loaded user$eval15802 from __JVM_DefineClass__]

These lines will stop once the PermGen space fills up.

In the jstat monitoring, you'll see the amount of used PermGen space (PU) increase to the max and stay there.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 31616.0  31552.7    365952.0         0.0      4     0    0.000    0.129
 32000.0  31914.0    365952.0         0.0      4     0    0.000    0.129
 32768.0  32635.5    365952.0         0.0      4     0    0.000    0.129
 32768.0  32767.6    365952.0      1872.0      5     1    0.000    0.177
 32768.0  32108.2    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32470.4    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258

A workaround is to run prefer-method before the PermGen space is all used up, e.g.

(prefer-method take* clojure.lang.Fn java.lang.Object)

Then, when the used PermGen space is close to the max, in the lein swank session, you will see the classes created by the eval'ing being unloaded.

[Unloading class user$eval5950$foo__5951]
[Unloading class user$eval3814]
[Unloading class user$eval2902$foo__2903]
[Unloading class user$eval13414]

In the jstat monitoring, there will be a long pause when used PermGen space stays close to the max, and then it will drop down, and start increasing again when more eval'ing occurs.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  17891.3    283776.0     17243.9      6     2   50.589   50.813
 32768.0  18254.2    283776.0     17243.9      6     2   50.589   50.813

The defmulti defines a cache that uses the dispatch values as keys. Each eval call in the loop defines a new foo class which is then added to the cache when take* is called, preventing the class from ever being GCed.

The prefer-method workaround works because it calls clojure.lang.MultiFn.preferMethod, which calls the private MultiFn.resetCache method, which completely empties the cache.

Protocol leak

The leak with protocol methods similarly involves a cache. You see essentially the same behavior as the multimethod leak if you run the following code using protocols.

protocol leak
(defprotocol ITake (take* [a]))

(extend-type clojure.lang.Fn
  ITake
  (take* [this] '()))

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

Again, the cache is in the take* method itself, using each new foo class as a key.

A workaround is to run -reset-methods on the protocol before the PermGen space is all used up, e.g.

(-reset-methods ITake)

This works because -reset-methods replaces the cache with an empty MethodImplCache.



 Comments   
Comment by Chouser [ 30/Jan/13 9:10 AM ]

I think the most obvious solution would be to constrain the size of the cache. Adding an item to the cache is already not the fastest path, so a bit more work could be done to prevent the cache from growing indefinitely large.

That does raise the question of what criteria to use. Keep the first n entries? Keep the n most recently used (which would require bookkeeping in the fast cache-hit path)? Keep the n most recently added?

Comment by Jamie Stephens [ 18/Oct/13 9:35 AM ]

At a minimum, perhaps a switch to disable the caches – with obvious performance impact caveats.

Seems like expensive LRU logic is probably the way to go, but maybe don't have it kick in fully until some threshold is crossed.

Comment by Alex Miller [ 18/Oct/13 4:28 PM ]

A report seeing this in production from mailing list:
https://groups.google.com/forum/#!topic/clojure/_n3HipchjCc

Comment by Adrian Medina [ 10/Dec/13 11:43 AM ]

So this is why we've been running into PermGen space exceptions! This is a fairly critical bug for us - I'm making extensive use of multimethods in our codebase and this exception will creep in at runtime randomly.

Comment by Kevin Downey [ 17/Apr/14 9:52 PM ]

it might be better to split this in to two issues, because at a very abstract level the two issues are the "same", but concretely they are distinct (protocols don't really share code paths with multimethods), keeping them together in one issue seems like a recipe for a large hard to read patch

Comment by Kevin Downey [ 26/Jul/14 5:49 PM ]

naive-lru-method-cache-for-multimethods.diff replaces the methodCache in multimethods with a very naive lru cache built on PersistentHashMap and PersistentQueue





[CLJ-1125] Clojure can leak memory when used in a servlet container Created: 11/Dec/12  Updated: 11/Jan/14  Resolved: 11/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: None
Fix Version/s: Release 1.6

Type: Defect Priority: Critical
Reporter: Toby Crawley Assignee: Unassigned
Resolution: Completed Votes: 14
Labels: memory

Attachments: File threadlocal-removal-tcrawley-2012-12-11.diff     File threadlocal-removal-tcrawley-2013-06-14.diff     File threadlocal-removal-tcrawley-2013-11-24.diff    
Patch: Code
Approval: Ok

 Description   

When used within a servlet container (Jetty/Tomcat/JBossAS/Immutant/etc), the thread locals Var.dvals (used to store dynamic bindings) and LockingTransaction.transaction (used to store the currently active transaction(s)) prevent all of the classes loaded by an application's clojure runtime from being garbage collected, resulting in a memory leak.

Cause: The issue comes from threads living beyond the lifetime of a deployment - servlet containers use thread pools that are shared across all applications within the container. Currently, the dvals and transaction thread locals are not discarded when they are no longer needed, causing their contents to retain a hard reference to their classloaders, which, in turn, causes all of the classes loaded under the application's classloader to be retained until the thread exits (which is generally at JVM shutdown).

Solution: I've attached a patch that does the following:

  • Var.dvals is initialized to a canonical TOP Frame
  • Var.dvals is now removed when the thread bindings are popped to the TOP
  • The outer transaction in LockingTransaction.transaction now removes the thread local when it is finished

There is still the opportunity for memory leaks if agents or futures are used, and the executors used for them are not shutdown when the app is undeployed. That's a solvable problem, but should probably be solved by the containers themselves (and/or the war generation tools) instead of in clojure itself.

This patch has a small performance impact: its use of a try/finally around running transactions to remove the outer transaction adds 4-6 microseconds to each transaction call on my hardware.

Providing an automated test for this patch is difficult - I've tested it locally with repeated deployments to a container while monitoring GC and permgen. All of clojure's tests pass with it applied.

The above is a condensation of:
https://groups.google.com/d/topic/clojure-dev/3CXDe8_9G58/discussion

Patch: threadlocal-removal-tcrawley-2013-11-24.diff

Screened by: Alex Miller - the new patch (since prior screening) has no changes in the LockingTransaction code but has been updated in Var to address the regression logged in CLJ-1299.



 Comments   
Comment by Colin Jones [ 13/May/13 7:30 PM ]

This patch works great for me to avoid OOM/PermGen crashes from classloaders being retained [mine is a non-servlet use case].

Comment by Stuart Halloway [ 24/May/13 9:43 AM ]

Does Tomcat create warnings for Clojure, as described e.g. here?

If so, does this patch make the warnings go away?

Comment by Toby Crawley [ 24/May/13 9:56 AM ]

Stu: that's a good question. I'll take a look at Tomcat this afternoon.

Comment by Stuart Halloway [ 24/May/13 10:04 AM ]

The code that calls transaction.remove() seems unncessarily subtle. There are two exits from the method, and only one is protected by the finally block.

If the "outer" case was a top-level if, the logic would be more clear, and only the "outer" case would need try/finally, which might reduce the performance penalty in the case of deeply nested dosyncs.

Did your transaction overhead of 4-6 microseconds test only one level of dosync, or many?

Comment by Stuart Halloway [ 24/May/13 10:13 AM ]

Because the unwind code calls remove at the top (as opposed to set(null)), the code should now be safe for use with Clojure-defined ThreadLocal subclasses.

Therefore, Var's use of an initialValue should be irrelevant to this patch, and it should be possible to fix this bug with a patch half the size of the current patch, touching only LockingTransaction.runInTransaction and Var.popThreadBindings.

Comment by Toby Crawley [ 14/Jun/13 7:38 AM ]

re: Tomcat ThreadLocal warnings

With Clojure 1.5.1 using my test app (linked below), I see:

Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [clojure.lang.Var$1] (value [clojure.lang.Var$1@4902919]) and a value of type [clojure.lang.Var.Frame] (value [clojure.lang.Var$Frame@147a2aa6]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
Jun 14, 2013 6:35:22 AM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/leak] created a ThreadLocal with key of type [java.lang.ThreadLocal] (value [java.lang.ThreadLocal@608602ca]) and a value of type [clojure.lang.LockingTransaction] (value [clojure.lang.LockingTransaction@7e214d47]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

With the original patch (threadlocal-removal-tcrawley-2012-12-11.diff) and the one attached today (threadlocal-removal-tcrawley-2013-06-14.diff), I no longer see these warnings.

re: the LockingTransaction.runInTransaction changes

In today's patch (threadlocal-removal-tcrawley-2013-06-14.diff), I modified runInTransaction to have one exit point, and only wrap a call to run with a try/finally in the outer transaction case. It does introduce two locations where run can be called to preserve the case where an inner transaction has null info:

static public Object runInTransaction(Callable fn) throws Exception{
	LockingTransaction t = transaction.get();
        Object ret;
	if(t == null) {
            transaction.set(t = new LockingTransaction());
            try {
                ret = t.run(fn);
            } finally {
                transaction.remove();
            }
        } else {
            if(t.info != null) {
                ret = fn.call();
            } else {
                ret = t.run(fn);
            }
        }

        return ret;
}

However, this will likely not reduce the speed penalty I observed in my testing, as I was only using a single level of dosync when capturing timing data.

re: removing initialValue from dvals

My original solution kept initialValue, but I then apparently discovered cases where the leak still occurred (see the mailing list thread).

Unfortunately, I can neither recreate that case, nor find in my notes, test code, or the clojure code a reason why keeping initialValue would allow the ThreadLocals to leak when popThreadBindings is patched (assuming one doesn't call Var.getThreadBindings from Java without calling Var.popThreadBindings).

Therefore, I've attached a simpler patch (threadlocal-removal-tcrawley-2013-06-14.diff) that just patches LockingTransaction.runInTransaction and Var.popThreadBindings.

I've also created a project that demonstrates the leak with 1.5.1, and that the leak does not appear with this patch applied to 1.6.0-master. See its README for usage details.

The patched version of 1.6.0-master is available as [org.clojars.tcrawley/clojure "1.6.0-clearthreadlocals"] if anyone wants to give it a try in their own projects. Note that since its group isn't 'org.clojure', you may need to add exclusions to your project to prevent another version of clojure being included.

Comment by Andy Fingerhut [ 14/Jun/13 10:56 AM ]

Presumptuously changing ticket approval from Incomplete back to its former Vetted state, since Toby's comments and new patch seem to address the comments that led Stu to change it to Incomplete.

Comment by Toby Crawley [ 02/Aug/13 10:30 AM ]

Stu:

Is there anything else you need from me for this to be applied?

Comment by Chas Emerick [ 04/Aug/13 5:52 PM ]

FWIW, using Toby's Clojure dep with Immutant has eliminated the out-of-permgen errors I used to occasionally get after N app redeployments.

Comment by Alex Miller [ 23/Aug/13 11:08 AM ]

I looked at the updated patch and it seems good to me. In the LockingTransaction.runinTransaction code the cases are driven by where t=null and t.info=null. Of those 4 cases, I believe the same call is being made in all but the case of t == null (where a new LockingTransaction is created) and t.info != null. However, I believe since a new txn is created and t.info should start as null, this case does not actually exist in practice.

Greatly appreciate Chas's experience feedback and all of Toby and Stu's work to make this change solid!

Marking screened.

Comment by Alex Miller [ 22/Nov/13 7:59 PM ]

Reverted in 1.6.0-alpha3 based on CLJ-1299 report.

Comment by Toby Crawley [ 24/Nov/13 5:22 PM ]

I just attached a new patch (threadlocal-removal-tcrawley-2013-11-24.diff) that achieves the same ThreadLocal removal as the previous patch, but addresses the issues with binding conveyance reported in CLJ-1299.





[ASYNC-32] onto-chan retains head of input sequence causing OutOfMemoryError Created: 06/Nov/13  Updated: 23/Jun/14

Status: Open
Project: core.async
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Brian Lubeski Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: memory
Environment:

org.clojure/core.async 0.1.242.0-44b1e3-alpha


Attachments: File patch.clj    
Patch: Code
Approval: Triaged

 Description   
(require '[clojure.core.async :as a])
(let [c (a/chan)]
  (a/onto-chan c (iterate inc 0))
  (while true
    (a/<!! c)))

onto-chan is holding on to the head of the input sequence as it is unfolded, resulting in an (eventual) OutOfMemoryError.

I've attached a diff showing changes I made to onto-chan that fixed the problem for me.






Generated at Mon Jul 28 05:55:29 CDT 2014 using JIRA 4.4#649-r158309.