Obviously I didn't write the original code, so I'm not the ideal
person to explain this stuff. But I did work with it a bit recently,
so in the hopes that I can be helpful, I'm writing down my
understanding of the code as I worked with it. Since I came to the
code and sort of reverse engineered my way to this understanding,
hopefully laying this all out will make any mistakes or
misunderstandings I may have made easier to catch and correct. To
ensure some stability, I'll talk about the original MultiFn code as it
stands at this commit:
https://github.com/clojure/clojure/blob/8fda34e4c77cac079b711da59d5fe49b74605553/src/jvm/clojure/lang/MultiFn.java
There are four pieces of state that change as the multimethod is either
populated with methods or called.
- methodTable: A persistent map from a dispatch value (Object) to
a function (IFn). This is the obvious thing you think it is,
determining which dispatch values call which function.
- preferTable: A persistent map from a dispatch value (Object) to
another value (Object), where the key "is preferred" to the value.
- methodCache: A persistent map from a dispatch value (Object) to
function (IFn). By default, the methodCache is assigned the same
map value as the methodTable. If values are calculated out of the
hierarchy during a dispatch, the originating value and the
ultimate method it resolves to are inserted as additional items in
the methodCache so that subsequent calls can jump right to the
method without recursing down the hierarchy and preference table.
- cachedHierarchy: An Object that refers to the hierarchy that is
reflected in the latest cached values. It is used to check if the
hierarchy has been updated since we last updated the cache. If it
has been updated, then the cache is flushed.
I think reset(), addMethod(), removeMethod(), preferMethod(),
prefers(), isA(), dominates(), and resetCache() are extremely
straightforward in both the original code and the patch. In the
original code, the first four of those are synchronized, and the other
four are only called from methods that are synchronized (or from
methods only called from methods that are synchronized).
Invoking a multimethod through its invoke() function will call
getFn(). getFn() will call the getMethod() function, which is
synchronized. This means any call of the multimethod will wait for and
take a lock as part of method invocation. The goal of the patch in
this issue is to remove this lock on calls into the multimethod. It in
fact removes the locks on all operations, and instead keeps its
internal mutable state by atomically swapping a private immutable
State object held in an Atom called state.
The biggest change in the patch is to the
getFn()>getMethod()>findAndCacheBestMethod() complex from the
original code. I'll describe how that code works first.
In the original code, getFn() does nothing but bounce through
getMethod(). getMethod() tries three times to find a method to call,
after checking that the cache is up to date and flushing it if it
isn't:
1. It checks if there's a method for the dispatch value in the
methodCache.
2. If not, it calls findAndCacheBestMethod() on the
dispatch value. findAndCacheBestMethod() does the following:
1. It iterates through every map entry in the method table,
keeping at each step the best entry it has found so far
(according to the hierarchy and preference tables).
2. If it did not find a suitable entry, it returns null.
3. Otherwise, it checks if the hierarchy has been changed since the cache
was last updated. If it has not changed, it inserts the method
into the cache and returns it. If it has been changed, it
resets the cache and calls itself recursively to repeat the process.
3. Failing all else, it will return the method for the default
dispatch value.
Again, remember everything in the list above happens in a call to a
synchronized function. Also note that as it is currently written,
findAndCacheBestMethod() uses recursion for iteration in a way that
grows the stack. This seems unlikely to cause a stack overflow unless
the multimethod is getting its hierarchy updated very rapidly for a
sustained period while someone else tries to call it. Nonetheless, the
hierarchy is held in an IRef that is updated independently of the
locking of the MultiFn. Finally, note that the multimethod is locked
only while the method is being found. Once it is found, the lock is
released and the method actually gets called afterwards without any
synchronization, meaning that by the time the method actually
executes, the multimethod may have already been modified in a way that
suggests a different method should have been called. Presumably this
is intended, understood, and not a big deal.
Moving on now to the patch in this issue. As mentioned, the main
change is updating this entire apparatus to work with a single atomic
swap to control concurrency. This means that all updates to the
multimethod's state have to happen at one instant in time. Where the
original code could make multiple changes to the state at different
times, knowing it was safely protected by an exclusive lock, rewriting
for atom swaps requires us to reorganize the code so that all updates
to the state happen at the same time with a single CAS.
To implement this change, I pulled the implicit looping logic from
findAndCacheBestMethod() up into getMethod() itself, and broke the
"findBestMethod" part into its own function, findBestMethod(), which
makes no update to any state while implementing the same
logic. getMethod() now has an explicit loop to avoid stack-consuming
recursion on retries. This infinite for loop covers all of the logic
in getMethod() and retries until a method is successfully found and a
CAS operation succeeds, or we determine that the method cannot be
found and we return the default dispatch value's implementation.
I'll now describe the operation of the logic in the for loop. The
first two steps in the loop correspond to things getMethod() does
"before" its looping construct in the original code, but we have to do
in the loop to get the latest values.
1. First we dereference our state, and assign this value to both
oldState and newState. We also set a variable called needWrite to
false; this is so we can avoid doing a CAS (they're not free) when
we have not actually updated the state.
2. Check if the cache is stale, and flush it if so. If the cache
gets flushed, set needWrite to true, as the state has changed.
3. Check if the methodCache has an entry for this dispatch
value. If so, we are "finished" in the sense that we found the
value we wanted. However, we may need to update the state. So,
- If needWrite is false, we can return without a CAS, so just
break out of the loop and return the method.
- Otherwise, we need to update the state object with a CAS. If
the CAS is successful, break out of the loop and return the
target function. Otherwise, continue on the next iteration
of the loop, skipping any other attempts to fetch the method
later in the loop (useless work, at this point).
4. The value was not in the methodCache, so call the function
findBestMethod() to attempt to find a suitable method based on the
hierarchy and preferences. If it does find us a suitable method,
we now need to cache it ourselves. We create a new state object
with the new method cache and attempt to update the state atom
with a CAS (we definitely need a write here, so no need to check
needWrite at this point).
The one thing that is possibly questionable is the check at this
point to make sure the hierarchy has not been updated since the
beginning of this method. I inserted this here to match the
identical check at the corresponding point in
findAndCacheBestMethod() in the original code. That is also a
second check, since the cache is originally checked for freshness
at the very beginning of getMethod() in the original code. That
initial check happens at the beginning of the loop in the
patch. Given that there is no synchronization with the method
hierarchy, it is not clear to me that this second check is needed,
since we are already proceeding with a snapshot from the beginning
of the loop. Nonetheless, it can't hurt as far as I can tell, it
is how the original code worked, and I assume there was some
reason for that, so I kept the second check.
5. Finally, if findBestMethod() failed to find us a method for the
dispatch value, find the method for the default dispatch value and
return that by breaking out of the loop.
So the organization of getMethod() in the patch is complicated by two
factors: (1) making the retry looping explicit and stackless, (2)
skipping the CAS when we don't need to update state, and (3) skipping
needless work later in the retry loop if we find a value but are
unable to succeed in our attempt to CAS. Invoking a multimethod that
has a stable hierarchy and a populated cache should not even have a
CAS operation (or memory allocation) on this code path, just a cache
lookup after the dispatch value is calculated.
http://groups.google.com/group/clojure-dev/browse_thread/thread/6a8219ae3d4cd0ae?hl=en