<< Back to previous view

[CLJ-1260] ConcurrentModificationException thrown during action dispatching after commit in LockingTransaction.run() Created: 12/Sep/13  Updated: 21/Jan/14  Resolved: 21/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.4, Release 1.5, Release 1.6
Fix Version/s: Release 1.6

Type: Defect Priority: Major
Reporter: Brandon Ibach Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: STM
Environment:

Discovered on Ubuntu 12.04 with Oracle JDK 1.7.0_25 running Clojure 1.4. Test case produced on Windows 8 with Oracle JDK 1.7.0_09 running Clojure 1.4 and Clojure 1.5.1. Analysis seems to indicate that OS and Java version are not critical. Clojure 1.6 pre-release code has not been tested, but since clojure.lang.LockingTransaction has not changed since Clojure 1.4, it seems likely the defect is still present.


Attachments: File clj-1260.diff     File clj-1260-fixws.diff     Java Source File STMAgentInitBug.java    
Patch: Code
Approval: Vetted

 Description   

Summary

Using the Clojure 1.4 library strictly from Java code, a simple transaction dispatches an action to an Agent. When called from a simple driver, such as a unit test, where there is no interaction with the Clojure library/runtime (specifically, clojure.lang.RT), a ConcurrentModificationException is thrown from inside LockingTransaction.run() while it is iterating through the actions list, dispatching each action to its Agent after committing the transaction.

While the circumstances under which this occurs are probably fairly rare and a simple workaround exists (see final paragraph), thus the "Minor" priority, it seems like it would not be very complicated to fix LockingTransaction to handle the actions list more safely.

Analysis

Based on some debugging, here's what seems to be happening:

  1. Transaction A is run, dispatching action Z, which gets added, via LockingTransaction.enqueue(), to the actions list, which is a java.util.ArrayList<Agent.Action>.
  2. Transaction A completes and is successfully committed.
  3. LockingTransaction.run() does post-commit cleanup, freeing locks and putting a stop() to transaction A, which nulls the transaction's Info object reference.
  4. Notifications are sent and we start iterating the list of actions to be dispatched.
  5. The run() method calls Agent.dispatchAction(). Because the thread's transaction object is no longer considered to be "running" (due to the Info object being null) and no action is being processed on the thread (so its nested vector is null), the action is enqueue()-ed with the Agent.
  6. As part of the enqueue() process, the action is cons()-ed onto the Agent's ActionQueue. Here's where the unique circumstances come into play.
    1. At this point, we haven't really interacted with the Clojure runtime, specifically the clojure.lang.RT class, so its initiation machinery kicks in.
    2. Down in the depths, it executes transaction B to add a library to its list of loaded libraries.
    3. The still-existing-but-not-running thread-local transaction object fires up, runs and commits, with its existing, intact actions list, still containing action Z, enqueued during transaction A, which has not yet finished its post-commit process.
    4. The post-commit process for transaction B runs, including a nested attempt to dispatch action Z, again, which succeeds.
    5. The actions list is cleared before exiting the run() method.
  7. Upon returning way back up the stack to our not-quite-finished-post-processing transaction A, we continue iterating the now-cleared actions list, which promptly throws the ConcurrentModificationException.

A quick perusal of the LockingTransaction code shows that the only interaction with the actions list is adding an item to it in the LockingTransaction.enqueue() method, iterating it in the post-processing section of run() and clearing it in the finally clause of that section, so it's easy to see how a transaction started by any of the action-dispatching machinery would cause problems. Any such activity in the actions themselves would not be an issue, since they'd occur on other threads, but the dispatch stuff all runs on the same thread. The few moving parts that occur in this code seem fairly safe, as long as the runtime, clojure.lang.RT, is already initialized, but if that occurs during this phase, all bets appear to be off.

Test Case

The attached Java class can be compiled and run with just the Clojure 1.4 JAR on the class path. With the change described near the end of the file (comment one line and uncomment another), the Clojure 1.5.1 JAR can be used, instead, producing the same result.

A single Agent named count is created, holding an Integer value of 1. A transaction is run which dispatches an action (referred to as Z in the above description) that will increment the value of count to 2. Following this, another action is dispatched to count to enable monitoring the completion of the incrementing action. Lastly, the final value of count is printed before the application exits.

Running the class with no command-line arguments produces the above-mentioned exception and prints an incorrect final result, due to action Z being run a second time as described in step 6.4. Running with any command-line argument triggers a simple workaround that just references a static value from the clojure.lang.RT class, which invokes the class initialization before anything else happens, such that the exception is not thrown and the correct result is produced.

Patch: clj-1260-fixws.diff

Screened by:



 Comments   
Comment by Alex Miller [ 12/Sep/13 3:29 PM ]

Rich asked me to move this one to Vetted/Release 1.6 when it came in.

Comment by Alex Miller [ 12/Sep/13 3:31 PM ]

FYI: submitter does not have a CA

Comment by Christophe Grand [ 18/Oct/13 4:09 AM ]

Would a patch as simple as triggering the load of RT from the static init of LockingTransaction be ok? (assuming the analysis is correct)

Comment by Guillermo Winkler [ 18/Oct/13 1:07 PM ]

The ConcurrentModificationException is raised because the list of `actions` protocol is being violated.

Last clojure version happens in this loop in LockingTransaction.run(Callable) line: 361

for(Agent.Action action : actions)

{ Agent.dispatchAction(action); }

}

When RT hasn't been initialized before, first time it's used is in PersistentQueue's cons RT.list(o)

public PersistentQueue cons(Object o){ if(f == null) //empty return new PersistentQueue(meta(), cnt + 1, RT.list(o), null); else return new PersistentQueue(meta(), cnt + 1, f, (r != null ? r : PersistentVector.EMPTY).cons(o)); }

With the following call stack.

RT.doInit() line: 447
RT.<clinit>() line: 329
PersistentQueue.cons(Object) line: 127
PersistentQueue.cons(Object) line: 24
Agent.enqueue(Agent$Action) line: 264
Agent.dispatchAction(Agent$Action) line: 255
LockingTransaction.run(Callable) line: 369
LockingTransaction.runInTransaction(Callable) line: 231
STMAgentInitBug.main(String[]) line: 26

Problem is RT initialization triggers a new runInTransaction for new class loading (protocols in this callstack)

Thread [main] (Suspended (breakpoint at line 361 in LockingTransaction))
LockingTransaction.run(Callable) line: 361
LockingTransaction.runInTransaction(Callable) line: 231
protocols__init.load() line: 9
protocols__init.<clinit>() line: not available
Class<T>.forName0(String, boolean, ClassLoader) line: not available [native method]
Class<T>.forName(String, boolean, ClassLoader) line: 247
RT.loadClassForName(String) line: 2099
RT.load(String, boolean) line: 430
RT.load(String) line: 411
core.clj line: 5546
core.clj line: 5545
core$load(RestFn).invoke(Object) line: 408
core__init.load() line: 6174
core__init.<clinit>() line: not available
Class<T>.forName0(String, boolean, ClassLoader) line: not available [native method]
Class<T>.forName(String, boolean, ClassLoader) line: 247
RT.loadClassForName(String) line: 2099
RT.load(String, boolean) line: 430
RT.load(String) line: 411
RT.doInit() line: 447
RT.<clinit>() line: 329
PersistentQueue.cons(Object) line: 127
PersistentQueue.cons(Object) line: 24
Agent.enqueue(Agent$Action) line: 264
Agent.dispatchAction(Agent$Action) line: 255
LockingTransaction.run(Callable) line: 369
LockingTransaction.runInTransaction(Callable) line: 231
STMAgentInitBug.main(String[]) line: 26

So LockingTransaction:run depends indeed on RT to be loaded, since the method re-enters the function if loading is triggered from the inside.

Maybe actions:

final ArrayList<Agent.Action> actions = new ArrayList<Agent.Action>();

Should be better protected to prevent enqueuing during the dispatch loop?

Comment by Brandon Ibach [ 18/Oct/13 2:02 PM ]

The additional detail provided in the last comment is correct and matches what I observed. However, protecting the "actions" list against enqueuing isn't really the issue, since this problem would occur even if the "nested" transaction didn't involve any actions. The problem is that committing a transaction clears the "actions" list.

I think the suggestion to trigger initialization of RT in the static initializer of LockingTransaction would address this issue, though in a somewhat roundabout way. A more direct approach might be to make a temporary copy of the contents of the "actions" list prior to iterating it. Obviously, the latter would make extra work on every transaction commit, so it may not be desirable. I don't know what impact the former approach would have, other than the long-term effect of it being non-obvious to a later observer why that code is needed.

I haven't thought this through completely or done any tests, so my reasoning may be wrong, but one more argument for the approach of making a copy of the "actions" list is that a Ref notification/watch could run a transaction, which, being on the same thread, would also clear the "actions" list. This case wouldn't cause the exception, since the iteration of the list would not have started, yet, but it would cause actions enqueued by the "outer" transaction to not be dispatched.

Comment by Guillermo Winkler [ 18/Oct/13 2:26 PM ]

You're right, problem is not on enqueue are inner transactions clearing the list.

Anyway the protection can be a smarter way of iteration.

Wouldn't a possible solution be treating the action list as a Queue? Pop'in each action would also clear only that action.

Comment by Guillermo Winkler [ 18/Oct/13 2:57 PM ]

Attached patch seems to solve the problem in a cleaner way.

Comment by Andy Fingerhut [ 19/Oct/13 5:32 PM ]

Patch clj-1260-fixws.diff is identical to Guillermo's clj-1260.diff, except it eliminates warnings when applying the patch that were due to trailing white space.

Comment by Stuart Sierra [ 17/Jan/14 1:27 PM ]

Here's my understanding of what happens:

1. User's Java code calls LockingTransaction.runInTransaction.
2. In the transaction's Callable, user calls Agent.dispatch.
3. LockingTransaction adds the agent action to its actions list.
4. The transaction commits.
5. LockingTransaction begins iterating over actions and dispatching them.
6. Agent.dispatchAction checks to see if there is a running transaction, which there is not.
7. Agent.dispatchAction line 255 calls Agent.enqueue.
8. Agent.enqueue line 264 calls PersistentQueue.cons.
9. PersistentQueue.cons line 127 calls RT.list, triggering initization of RT.
10. RT's static initializer calls doInit.
11. doInit loads core.clj.
12. core.clj loads other files which contain ns declarations.
13. ns expands to code which commutes loaded-libs in a transaction.
14. The new transaction gets the same ThreadLocal LockingTransaction.
15. The new transaction executes and clears actions.
16. Back up the stack in the original transaction, LockingTransaction tries to continue iterating over actions and throws ConcurrentModificationException.

In short, this situation can only occur when the mere act of
enqueueing agent actions, not executing them, can create another
transaction on the same thread. This can happen when clojure.lang.RT
has not been initialized.

This would never occur in a Clojure program because RT is already
initialized when the Clojure program starts, and none of the Java code
involved in enqueueing agent actions will create new transactions.

The workaround for Java is trivial: just make sure clojure.lang.RT is
initialized before using any other Clojure classes.

However, the patch clj-1260-fixws.diff is a small and safe change. Its
author, Guillermo Winkler, has a CA.

The approach is to treat the actions list in LockingTransaction
as a queue, popping actions off one at a time, and never to
explicitly clear the queue.

The more important question is if/when RT should be initialized
automatically.

Comment by Guillermo Winkler [ 17/Jan/14 1:46 PM ]

Solving this problem by initializing RT feels to me like complecting two things not necessarily related, if PersistentQueue stops using RT, relationship disappears and you may still have the bug.

Anyway RT could be initialized safely so anyone that may depend on it finds it already there, avoiding potential clashes on initialization.

Comment by Alex Miller [ 17/Jan/14 1:49 PM ]

And following up on that, Clojure 1.6 has a new public Java API in clojure.java.api.Clojure. Loading that class or using any method on the class will serve to load clojure.lang.RT as a side effect.

Comment by Guillermo Winkler [ 21/Jan/14 9:11 AM ]

Alex, do you want a patch for the safe RT initialization here? Or do you prefer to handle it using a different ticket?

Comment by Alex Miller [ 21/Jan/14 9:32 AM ]

When you are using Clojure from Java, it is expected that you should properly initialize the Clojure runtime by (at least) loading the RT class (for Clojure <1.6) or the Clojure class (for Clojure >= 1.6).





Generated at Wed Apr 23 09:52:24 CDT 2014 using JIRA 4.4#649-r158309.