<< Back to previous view

[CLJ-1330] Class name clash between top-level functions and defn'ed ones Created: 22/Jan/14  Updated: 13/Sep/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: None
Fix Version/s: Release 1.7

Type: Defect Priority: Critical
Reporter: Nicola Mometto Assignee: Alex Miller
Resolution: Unresolved Votes: 7
Labels: aot, compiler

Attachments: Text File 0001-CLJ-1093-v2.patch     Text File 0001-Fix-CLJ-1330-make-top-level-named-functions-classnam.patch     Text File 0001-Fix-CLJ-1330-refactored.patch     File demo1.clj    
Patch: Code
Approval: Vetted

 Description   

Named anonymous fn's are not guaranteed to have unique class names when AOT-compiled.

For example:

(defn g [])
(def xx (fn g []))

When AOT-compiled both functions will emit user$g.class, the latter overwriting the former.

Demonstration script: demo1.clj

Patch: 0001-CLJ-1093-v2.patch

Approach: Generate unique class names for named fn's the same way as for unnamed anonymous fn's.

See also: This patch also fixes the issue reported in CLJ-1227.



 Comments   
Comment by Ambrose Bonnaire-Sergeant [ 22/Jan/14 11:12 AM ]

This seems like the reason why jvm.tools.analyzer cannot analyze clojure.core. On analyzing a definline, there is an "attempted duplicate class definition" error.

This doesn't really matter, but I thought it may or may not be useful information to someone.

Comment by Nicola Mometto [ 22/Jan/14 11:35 AM ]

Attached a fix.

This also fixes AOT compiling of code like:

(def x (fn foo []))
(fn foo [])
Comment by Nicola Mometto [ 22/Jan/14 11:39 AM ]

Cleaned up patch

Comment by Alex Miller [ 22/Jan/14 12:43 PM ]

It looks like the patch changes indentation of some of the code - can you fix that?

Comment by Nicola Mometto [ 22/Jan/14 3:57 PM ]

Updated patch without whitespace changes

Comment by Alex Miller [ 22/Jan/14 4:15 PM ]

Thanks, that's helpful.

Comment by Alex Miller [ 24/Jan/14 10:03 AM ]

There is consensus that this is a problem, however this is an area of the code with broad impacts as it deals with how classes are named. To that end, there is some work that needs to be done in understanding the impacts before we can consider it.

Some questions we would like to answer:

1) According to Rich, naming of (fn x []) function classes used to work in the manner of this patch - with generated names. Some code archaeology needs to be done on why that was changed and whether the change to the current behavior was addressing problems that we are likely to run into.

2) Are there issues with recursive functions? Are there impacts either in AOT or non-AOT use cases? Need some tests.

3) Are there issues with dynamic redefinition of functions? With the static naming scheme, redefinition causes a new class of the same name which can be picked up by reload of classes compiled to the old definition. With the dynamic naming scheme, redefinition will create a differently named class so old classes can never pick up a redefinition. Is this a problem? What are the impacts with and without AOT? Need some tests.

Comment by Nicola Mometto [ 24/Jan/14 11:39 AM ]

Looks like the current behaviour has been such since https://github.com/clojure/clojure/commit/4651e60808bb459355a3a5d0d649c4697c672e28

My guess is that Rich simply forgot to consider the (def y (fn x [] ..)) case.

Regarding 2 and 3, the dynamic naming scheme is no different than what happens for anonymous functions so I don't see how this could cause any issue.

Recursion on the fn arg is simply a call to .invoke on "this", it's classname unaware.

I can add some tests to test that

(def y (fn x [] 1))
and
(fn x [] 2)
compile to different classnames but other than that I don't see what should be tested.

Comment by Stuart Halloway [ 27/Jun/14 2:17 PM ]

incomplete pending the answers to Alex Miller's questions in the comments

Comment by Nicola Mometto [ 27/Jun/14 3:20 PM ]

I believe I already answered his questions, I'll try to be a bit more explicit:
I tracked the relevant commit from Rich which added the dynamic naming behaviour https://github.com/clojure/clojure/commit/4651e60808bb459355a3a5d0d649c4697c672e28#diff-f17f860d14163523f1e1308ece478ddbL3081 which clearly shows that this bug was present since then so.

Regarding redefinitions or recursive functions, both of those operations never take in account the generated fn name so they are unaffected.

Comment by Alex Miller [ 12/Sep/14 4:32 PM ]

Summarizing some cases here from before/after the patch:

1) top-level fn (always has name)
	1.6 - namespace$name
	patch - namespace$name
2) non-top-level fn with name
	1.6 - namespace$name (collides with prior case)
	patch - namespace$topname__x__name  	<-- CHANGED
3) anonymous fn (no name)
	1.6 - namespace$name$fn__x
	patch - namespace$name$fn__x
4) top-level anonymous fn (no name, not at all useful :)
	1.6 - namespace$fn__x
	patch - namespace$fn__x

The key problem is that the first 2 cases produce the identical class name on 1.6. The patch alters the non-top-level named fn so there is no conflict.

Prior to the referenced old commit, I believe cases 1 and 2 would both produce namespace$name__x (where x is unique) so they would not collide. The change was made to prevent the top-level name from changing ("don't append numbers on top-level fn class names"). While the similar change was made on non-top-level fn names, I do not think it needed to be.

I've thought through (and tried) a bunch of the implications of this with the help of Nicola's comments above and I do not see an issue with any of the things I've considered. From a binary compatibility point of view with existing AOT code, old code compiled together should be self-consistent and continue to work. New compiled code will also be consistent. I can't think of a way that new code would expect to know the old name of a non-top-level function such that there could be an issue.

One question - why change the code such that the new class name is namespace$name$topname__x__name instead of namespace$name$topname_name__x (or something else?). And relatedly, while the diff is small, could we refactor a couple more lines to make the intent and cases clearer?

I am 90% ok with this patch but want a little thought into that question before I mark screened.

Comment by Nicola Mometto [ 12/Sep/14 4:47 PM ]

Alex, the attached patch munges into ns$topname__name__x, not into ns$topname__x__name.

Comment by Nicola Mometto [ 12/Sep/14 5:22 PM ]

The attached patch 0001-Fix-CLJ-1330refactored.patch contains the same fix from 0001-FixCLJ-1330-make-top-level-named-functions-classnam.patch but also refactors the code that deals with fn name munging

Comment by Alex Miller [ 12/Sep/14 6:22 PM ]

Hmmm.. I will double-check. That's not why I recall seeing when I did AOT.

Comment by Nicola Mometto [ 12/Sep/14 7:26 PM ]

New patch 0001-CLJ-1093-v2.patch improves the fn naming scheme a lot.
I've threw together a number of test cases that show the improvement + bug fixes:

pre patch:

Clojure 1.7.0-master-SNAPSHOT
user=> (fn [])
#<user$eval1$fn__2 user$eval1$fn__2@4e13aa4e>
user=> (fn a [])
#<user$eval5$a__6 user$eval5$a__6@6946a317>
user=> (let [a (fn [])] a)
#<user$eval9$a__10 user$eval9$a__10@15fdf894>
user=> (let [a (fn a [])] a)
#<user$eval13$a__14 user$eval13$a__14@1d87a604>
user=> (let [a (fn x [])] a)
#<user$eval17$x__18 user$eval17$x__18@7f0cd67f>
user=> (def a (fn [])) a
#'user/a
#<user$a user$a@33e1ccbc>
user=> (def a (fn x [])) a
#'user/a
#<user$x user$x@59a04a1b> ;; <== BUG!!
user=> (def ^{:foo (fn [])} a) (-> (meta #'a) :foo)
#'user/a
#<user$fn__23 user$fn__23@d9c21c6>
user=> (def ^{:foo (fn a [])} a) (-> (meta #'a) :foo)
#'user/a
#<user$a user$a@420dd874> ;; <== BUG!!
user=> (def a (fn [] (fn []))) (a)
#'user/a
#<user$a$fn__30 user$a$fn__30@6f57be76>
user=> (def a (fn [] (fn x []))) (a)
#'user/a
#<user$a$x__35 user$a$x__35@79930089>
user=> (let [x (fn [])] (def a (fn [] x))) a (a)
#'user/a
#<user$eval40$a__43 user$eval40$a__43@6db1694e>
#<user$eval40$x__41 user$eval40$x__41@20bd16bb>
user=> (let [x (fn a [])] (def a (fn [] x))) (a)
#'user/a
#<user$eval48$a__49 user$eval48$a__49@75d6d1d4> ;; <== the local binding name doesn't appear in the class name
user=> (let [x (fn x [])] (def a (fn [] x))) (a)
#'user/a
#<user$eval56$x__57 user$eval56$x__57@16d81c91>

post patch:

Clojure 1.7.0-master-SNAPSHOT
user=> (fn [])
#<user$eval1$fn__3 user$eval1$fn__3@3c92218c>
user=> (fn a [])
#<user$eval6$a__8 user$eval6$a__8@6f85c59c>
user=> (let [a (fn [])] a)
#<user$eval11$a__13 user$eval11$a__13@4d051922>
user=> (let [a (fn a [])] a)
#<user$eval16$a__a__18 user$eval16$a__a__18@138a92e7>
user=> (let [a (fn x [])] a)
#<user$eval21$a__x__23 user$eval21$a__x__23@528ef256>
user=> (def a (fn [])) a
#'user/a
#<user$a user$a@6bef63f9>
user=> (def a (fn x [])) a
#'user/a
#<user$a__x__28 user$a__x__28@5f0bebef>
user=> (def ^{:foo (fn [])} a) (-> (meta #'a) :foo)
#'user/a
#<user$fn__30 user$fn__30@4cf0f2eb>
user=> (def ^{:foo (fn a [])} a) (-> (meta #'a) :foo)
#'user/a
#<user$a__35 user$a__35@37ff95a9>
user=> (def a (fn [] (fn []))) (a)
#'user/a
#<user$a$fn__41 user$a$fn__41@fd34eac>
user=> (def a (fn [] (fn x []))) (a)
#'user/a
#<user$a$x__48 user$a$x__48@6fc334de>
user=> (let [x (fn [])] (def a (fn [] x))) a (a)
#'user/a
#<user$eval54$a__58 user$eval54$a__58@7c721de>
#<user$eval54$x__56 user$eval54$x__56@43f7b41b>
user=> (let [x (fn a [])] (def a (fn [] x))) (a)
#'user/a
#<user$eval64$x__a__66 user$eval64$x__a__66@460d4> ;; <== the local binding name is included in the class name
user=> (let [x (fn x [])] (def a (fn [] x))) (a)
#'user/a
#<user$eval74$x__x__76 user$eval74$x__x__76@77cab87f>
user=>

As you can see, this last patch not only fixes the two bugs, but also improves fn naming in let contexts by preserving the name of the local binding in the class name, this I believe will be a great improvement in the understandability of stacktraces.





[CLJ-1152] PermGen leak in multimethods and protocol fns when evaled Created: 30/Jan/13  Updated: 15/Sep/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.4
Fix Version/s: Release 1.7

Type: Defect Priority: Critical
Reporter: Chouser Assignee: Unassigned
Resolution: Unresolved Votes: 6
Labels: memory, protocols

Attachments: File naive-lru-for-multimethods-and-protocols.diff     File protocol_multifn_weak_ref_cache.diff    
Patch: Code
Approval: Vetted

 Description   

There is a PermGen memory leak that we have tracked down to protocol methods and multimethods called inside an eval, because of the caches these methods use. The problem only arises when the value being cached is an instance of a class (such as a function or reify) that was defined inside the eval. Thus extending IFn or dispatching a multimethod on an IFn are likely triggers.

Reproducing: The easiest way that I have found to test this is to set "-XX:MaxPermSize" to a reasonable value so you don't have to wait too long for the PermGen spaaaaace to fill up, and to use "-XX:+TraceClassLoading" and "-XX:+TraceClassUnloading" to see the classes being loaded and unloaded.

leiningen project.clj
(defproject permgen-scratch "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.5.0-RC1"]]
  :jvm-opts ["-XX:MaxPermSize=32M"
             "-XX:+TraceClassLoading"
             "-XX:+TraceClassUnloading"])

You can use lein swank 45678 and connect with slime in emacs via M-x slime-connect.

To monitor the PermGen usage, you can find the Java process to watch with "jps -lmvV" and then run "jstat -gcold <PROCESS_ID> 1s". According to the jstat docs, the first column (PC) is the "Current permanent space capacity (KB)" and the second column (PU) is the "Permanent space utilization (KB)". VisualVM is also a nice tool for monitoring this.

Multimethod leak

Evaluating the following code will run a loop that eval's (take* (fn foo [])).

multimethod leak
(defmulti take* (fn [a] (type a)))

(defmethod take* clojure.lang.Fn
  [a]
  '())

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

In the lein swank session, you will see many lines like below listing the classes being created and loaded.

[Loaded user$eval15802$foo__15803 from __JVM_DefineClass__]
[Loaded user$eval15802 from __JVM_DefineClass__]

These lines will stop once the PermGen space fills up.

In the jstat monitoring, you'll see the amount of used PermGen space (PU) increase to the max and stay there.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 31616.0  31552.7    365952.0         0.0      4     0    0.000    0.129
 32000.0  31914.0    365952.0         0.0      4     0    0.000    0.129
 32768.0  32635.5    365952.0         0.0      4     0    0.000    0.129
 32768.0  32767.6    365952.0      1872.0      5     1    0.000    0.177
 32768.0  32108.2    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32470.4    291008.0     23681.8      6     2    0.827    1.006
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258
 32768.0  32767.2    698880.0     24013.8      8     4    1.073    1.258

A workaround is to run prefer-method before the PermGen space is all used up, e.g.

(prefer-method take* clojure.lang.Fn java.lang.Object)

Then, when the used PermGen space is close to the max, in the lein swank session, you will see the classes created by the eval'ing being unloaded.

[Unloading class user$eval5950$foo__5951]
[Unloading class user$eval3814]
[Unloading class user$eval2902$foo__2903]
[Unloading class user$eval13414]

In the jstat monitoring, there will be a long pause when used PermGen space stays close to the max, and then it will drop down, and start increasing again when more eval'ing occurs.

-    PC       PU        OC          OU       YGC    FGC    FGCT     GCT
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  32767.9    159680.0     24573.4      6     2    0.167    0.391
 32768.0  17891.3    283776.0     17243.9      6     2   50.589   50.813
 32768.0  18254.2    283776.0     17243.9      6     2   50.589   50.813

The defmulti defines a cache that uses the dispatch values as keys. Each eval call in the loop defines a new foo class which is then added to the cache when take* is called, preventing the class from ever being GCed.

The prefer-method workaround works because it calls clojure.lang.MultiFn.preferMethod, which calls the private MultiFn.resetCache method, which completely empties the cache.

Protocol leak

The leak with protocol methods similarly involves a cache. You see essentially the same behavior as the multimethod leak if you run the following code using protocols.

protocol leak
(defprotocol ITake (take* [a]))

(extend-type clojure.lang.Fn
  ITake
  (take* [this] '()))

(def stop (atom false))
(def sleep-duration (atom 1000))

(defn run-loop []
  (when-not @stop
    (eval '(take* (fn foo [])))
    (Thread/sleep @sleep-duration)
    (recur)))

(future (run-loop))

(reset! sleep-duration 0)

Again, the cache is in the take* method itself, using each new foo class as a key.

Workaround: A workaround is to run -reset-methods on the protocol before the PermGen space is all used up, e.g.

(-reset-methods ITake)

This works because -reset-methods replaces the cache with an empty MethodImplCache.

Patch: multifn_weak_method_cache.diff

Screened by:



 Comments   
Comment by Chouser [ 30/Jan/13 9:10 AM ]

I think the most obvious solution would be to constrain the size of the cache. Adding an item to the cache is already not the fastest path, so a bit more work could be done to prevent the cache from growing indefinitely large.

That does raise the question of what criteria to use. Keep the first n entries? Keep the n most recently used (which would require bookkeeping in the fast cache-hit path)? Keep the n most recently added?

Comment by Jamie Stephens [ 18/Oct/13 9:35 AM ]

At a minimum, perhaps a switch to disable the caches – with obvious performance impact caveats.

Seems like expensive LRU logic is probably the way to go, but maybe don't have it kick in fully until some threshold is crossed.

Comment by Alex Miller [ 18/Oct/13 4:28 PM ]

A report seeing this in production from mailing list:
https://groups.google.com/forum/#!topic/clojure/_n3HipchjCc

Comment by Adrian Medina [ 10/Dec/13 11:43 AM ]

So this is why we've been running into PermGen space exceptions! This is a fairly critical bug for us - I'm making extensive use of multimethods in our codebase and this exception will creep in at runtime randomly.

Comment by Kevin Downey [ 17/Apr/14 9:52 PM ]

it might be better to split this in to two issues, because at a very abstract level the two issues are the "same", but concretely they are distinct (protocols don't really share code paths with multimethods), keeping them together in one issue seems like a recipe for a large hard to read patch

Comment by Kevin Downey [ 26/Jul/14 5:49 PM ]

naive-lru-method-cache-for-multimethods.diff replaces the methodCache in multimethods with a very naive lru cache built on PersistentHashMap and PersistentQueue

Comment by Kevin Downey [ 28/Jul/14 7:09 PM ]

naive-lru-for-multimethods-and-protocols.diff creates a new class clojure.lang.LRUCache that provides an lru cache built using PHashMap and PQueue behind an IPMap interface.

changes MultiFn to use an LRUCache for its method cache.

changes expand-method-impl-cache to use an LRUCache for MethodImplCache's map case

Comment by Kevin Downey [ 30/Jul/14 3:10 PM ]

I suspect my patch naive-lru-for-multimethods-and-protocols.diff is just wrong, unless MethodImplCache really is being used as a cache we can't just toss out entries when it gets full.

looking at the deftype code again, it does look like MethidImplCache is being used as a cache, so maybe the patch is fine

if I am sure of anything it is that I am unsure so hopefully someone who is sure can chime in

Comment by Nicola Mometto [ 31/Jul/14 11:02 AM ]

I haven't looked at your patch, but I can confirm that the MethodImplCache in the protocol function is just being used as a cache

Comment by dennis zhuang [ 08/Aug/14 6:21 AM ]

I developed a new patch that convert the methodCache in MultiFn to use WeakReference for dispatch value,and clear the cache if necessary.

I've test it with the code in ticket,and it looks fine.The classes will be unloaded when perm gen is almost all used up.

Comment by Alex Miller [ 22/Aug/14 4:55 PM ]

I don't know which to evaluate here. Does multifn_weak_method_cache.diff supersede naive-lru-for-multimethods-and-protocols.diff or are these alternate approaches both under consideration?

Comment by Kevin Downey [ 22/Aug/14 8:26 PM ]

the most straight forward thing, I think, is to consider them as alternatives, I am not a huge fan of weakrefs, but of course not using weakrefs we have to pick some bounding size for the cache, and the cache has a strong reference that could prevent a gc, so there are trade offs. My reasons to stay away from weak refs in general are using them ties the behavior of whatever you are building to the behavior of the gc pretty strongly. that may be considered a matter of personal taste

Comment by Andy Fingerhut [ 29/Aug/14 4:31 PM ]

All patches dated Aug 8 2014 and earlier no longer applied cleanly to latest master after some commits were made to Clojure on Aug 29, 2014. They did apply cleanly before that day.

I have not checked how easy or difficult it might be to update the patches.

Comment by Kevin Downey [ 29/Aug/14 7:00 PM ]

I've updated naive-lru-for-multimethods-and-protocols.diff to apply to the current master

Comment by Andy Fingerhut [ 29/Aug/14 7:34 PM ]

Thanks, Kevin. While JIRA allows multiple attachments to a ticket with the same filename but different contents, that can be confusing for people looking for a particular patch, and for a program I have that evaluates patches for things like whether they apply and build cleanly. Would you mind removing the older one, or in some other way making all the names unique?

Comment by Kevin Downey [ 29/Aug/14 8:43 PM ]

I deleted all of my attachments accept for my latest and greatest

Comment by dennis zhuang [ 30/Aug/14 9:51 AM ]

I updated multifn_weak_method_cache2.diff patch too.

I think using weak reference cache is better,because we have to keep one cache per multifn.When you have many multi-functions, there will be many LRU caches in memory,and they will consume too much memory and CPU for evictions. You can't choose a proper threshold for LRU cache in every environment.
But i don't have any benchmark data to support my opinion.

Comment by Alex Miller [ 10/Sep/14 2:38 PM ]

I'm going to set the LRU cache patch aside. I don't think it's possible to find a "correct" size for it and it seems weird to me to extend APersistentMap to build such a thing anyways.

I think it makes more sense to follow the same strategy used for other caches (such as the Keyword cache) - a combination ConcurrentHashMap with WeakReferences and a ReferenceQueue for clean-up. I don't see any compelling reason not to take the same path as other internal caches.

Comment by Alex Miller [ 10/Sep/14 3:44 PM ]

Stepping back a little to think about the problem.... our requirements are:
1) cache map of dispatch value (could be any Object) to multimethod function (IFn)
2) do we want keys to be compared based on equality or identity? identity-based opens up more reference-based caching options and is fine for most common dispatch types (Class, Keyword), but reduces (often eliminates?) cache hits for all other types where values are likely to be equiv but not identical (vector of strings for example)
3) concurrent access to cache
4) cache cannot grow without bound
5) cache cannot retain strong references to dispatch values (the cache keys) because the keys might be instances of classes that were loaded in another classloader which will prevent GC in permgen

multifn_weak_method_cache.diff uses a ConcurrentHashMap (#3) that maps RefWrapper around keys to IFn (#1). The patch uses Util.equals() (#2) for (Java) equality-based comparisons. The RefWrapper wraps them in WeakReferences to avoid #5. Cache clearing based on the ReferenceQueue is used to prevent #4.

A few things definitely need to be fixed:

  • Util.equals() should be Util.equiv()
  • methodCache and rq should be final
  • Why does RefWrapper have obj and expect rq to possibly be null?
  • RefWrapper fields should all be final
  • Whitespace errors in patch

Another idea entirely - instead of caching dispatch value, cache based on hasheq of dispatch value then equality check on value. Could then use WeakHashMap and no RefWrapper.

This patch does not cover the protocol cache. Is that just waiting for the multimethod case to look good?

Comment by dennis zhuang [ 10/Sep/14 7:18 PM ]

Hi, alex, thanks for your review.But the latest patch is multifn_weak_method_cache2.diff. I will update the patch soon by your review, but i have a few questions to be explained.

1) I will use Util.equiv() instead of Util.equals().But what's the difference of them?
2) When the RefWrapper is retained as key in ConcurrentHashMap, it wraps the obj in WeakReference.But when trying to find it in ConcurrentHashMap, it uses obj directly as strong reference, and create it with passing null ReferenceQueue.Please look at the multifn_weak_method_cache2.diff line number 112. It short, the patch stores the dispatch value as weak reference in cache,but uses strong reference for cache getting.

3) If caching dispatch value based on hasheq , can we avoid hasheq value conflicts? If two different dispatch value have a same hasheq( or why it doesn't happen?), we would be in trouble.

Sorry, the patch doesn't cover the protocol cache, i will add it ASAP.

Comment by dennis zhuang [ 11/Sep/14 2:02 AM ]

The new patch 'protocol_multifn_weak_ref_cache.diff' is uploaded.

1) Using Util.equiv() instead of Util.equals()
2) Moved the RefWrapper and it's associated methods to Util.java, and refactor the code based on alex's review.
3) Fixed whitespace errors.
4) Fixed PermGen leak in protocol fns.





Generated at Fri Sep 19 09:00:02 CDT 2014 using JIRA 4.4#649-r158309.