Error formatting macro: pagetree: java.lang.NullPointerException

# Better hashing

Go to start of metadata
You are viewing an old version of this page. View the current version.

Mark Engelberg wrote the following in late Oct 2013, originally published here.

## Summary

Clojure’s hashing strategy for numbers, sequences/vectors, sets, and maps mimics Java’s.  In Clojure, however, it is far more common than in Java to use longs, vectors, sets, maps and compound objects comprised of those components (e.g., a map from vectors of longs to sets) as keys in other hash maps.  It appears that Java’s hash strategy is not well-tuned for this kind of usage.  Clojure’s hashing for longs, vectors, sets, and maps each suffer from some weaknesses that can multiply together to create a crippling number of collisions.

## History

Paul Butcher posted a Scala and Clojure version of a varation of the N-Queens problem (discussion thread here, Paul's github repo here, Andy Fingerhut's statistics-enhanced version here).  On a complex problem, the Scala version completed in 2 minutes, whereas the Clojure version would run overnight without completion.  Profiling the Clojure code demonstrated that the time was spent almost entirely in hash collisions.  He modeled an arrangement of pieces as a set of [piece-code [row col]].  This is a perfectly reasonable, idiomatic way to represent data in Clojure.  Andy Fingerhut posted a sample of the collisions that were occurring in the code:

97392280 = (hash #{[:N [2 0]] [:R [4 2]] [:K [0 0]] [:Q [1 3]] [:B [3 0]]})

97392280 = (hash #{[:N [3 0]] [:K [0 0]] [:R [1 2]] [:Q [4 3]] [:B [2 0]]})

97392280 = (hash #{[:N [3 0]] [:K [0 0]] [:R [2 1]] [:Q [1 4]] [:B [4 0]]})

97392280 = (hash #{[:N [4 0]] [:K [0 0]] [:R [3 3]] [:Q [1 2]] [:B [2 0]]})

97392280 = (hash #{[:N [3 0]] [:K [1 0]] [:R [0 2]] [:Q [2 3]] [:B [4 0]]})

97392280 = (hash #{[:N [3 0]] [:K [1 0]] [:R [2 3]] [:Q [0 2]] [:B [4 0]]})

97392280 = (hash #{[:N [4 0]] [:K [1 0]] [:R [0 2]] [:Q [2 3]] [:B [3 0]]})

97392280 = (hash #{[:N [4 0]] [:K [1 0]] [:R [2 3]] [:Q [0 2]] [:B [3 0]]})

97392280 = (hash #{[:K [3 0]] [:N [0 0]] [:Q [4 2]] [:R [2 3]] [:B [1 0]]})

97392280 = (hash #{[:K [3 0]] [:N [0 0]] [:R [4 2]] [:Q [2 3]] [:B [1 0]]})

97392280 = (hash #{[:N [1 0]] [:K [3 0]] [:Q [4 2]] [:R [2 3]] [:B [0 0]]})

97392280 = (hash #{[:N [1 0]] [:K [3 0]] [:R [4 2]] [:Q [2 3]] [:B [0 0]]})

97392280 = (hash #{[:K [4 0]] [:N [0 0]] [:Q [3 2]] [:R [1 3]] [:B [2 0]]})

97392280 = (hash #{[:K [4 0]] [:N [1 0]] [:R [3 2]] [:Q [0 3]] [:B [2 0]]})

97392280 = (hash #{[:K [4 0]] [:N [1 0]] [:R [2 1]] [:Q [3 4]] [:B [0 0]]})

97392280 = (hash #{[:N [2 0]] [:K [4 0]] [:R [0 2]] [:Q [3 3]] [:B [1 0]]})

What you see here is that the pieces, rows, and columns are getting shuffled around with no impact on the overall hash value.  When I saw this, I realized that it has also been a factor in my own code, and became interested in helping to solve the problem.  I have several programs that I have profiled and found hash collisions to be a significant bottleneck (not enough to make the program unusable, but enough that it is dominating running time, making it difficult to improve the running time without addressing the collision problem).  Let’s take a look at some of the factors that contribute to this and other types of common hash collisions.

## Longs

Java’s hash codes are all ints.  This lends itself to a rather obvious hashCode method for ints -- the identity function!  It’s fast, and it guarantees that no two ints hash to the same value.

For longs, we need to crunch it down to an int.  Ideally, the hashCode should utilize all 64 bits of the long.  So Java (and by extension Clojure) uses the simplest possible scheme that does that: it xors together the upper 32 bits and the lower 32 bits to crunch the long into an int.

The only downside of this is that int-sized negative numbers just get converted into int-sized positive numbers by this process.

=> (hash -1)

0

=> (hash -2)

1

=> (hash -3)

2

=> (hash 0)

0

=> (hash 1)

1

=> (hash 2)

2

The problem here is that when you have two numbers that hash to the same value, they will be interchangeable in all hashing contexts.  So for example, there is no difference between -1 and 0 in any collection.  This means that in applications that use a mixture of negative and positive numbers distributed roughly evenly around 0, we’re doubling the number of hash collisions.  Note that this was not a factor in Paul’s application, but it potentially could be.

It would be nice if hashing of longs more closely matched the hashing of ints, i.e., the int-range negative numbers should hash to themselves.  It is easy to make this happen.

The current way of doing things corresponds to:

public static final int longHash(long a) {

return (int) (a ^ (a >>> 32));

}

We can improve the situation with a single increment:

public static final int longHashWithExtraIncrement(long a) {

return (int) (a ^ ((a >>> 32) + 1));

}

With the latter scheme, (hash -2) would be -2, just like for ints, which is an improvement.

Mikera posted a clever algorithm that munges a 64-bit long in a way that widely disperses the resulting int.  It looks like this:

public static final int longHashMunge(long a) {

a ^= (a << 13);

a ^= (a >>> 7);

a ^= (a << 17);

a ^= (a >>> 32);

return (int) a;

}

The nice thing about this function is that you get widely distributed hash numbers.  Using the munge version, you’d get behavior like this:

=> (hash -2)

62914590

=> (hash -1)

31457295

=> (hash 0)

0

=> (hash 1)

35651601

=> (hash 2)

71303202

=> (hash 3)

106954803

The advantage of the munge version is that not only does it separate the small negative longs from the small positive longs, but it scatters the hash values widely enough that it also has a beneficial effect on the hashing of vectors/sequences, as we’ll see in a moment.

Recommendation:

Replace numeric hashing with one of the two improvements listed above.

Expected performance cost:

No obvious performance cost observed in micro-benchmarks.  Test both versions on larger examples to see whether one is better than the other.

Benefit:

In applications with a mixture of positive and negative numbers, potentially cuts hash collisions in half.  Munge version also fixes issue with vectors, and helps a little with sets and maps.

## Vectors

The current hashing scheme hashes:

[a_n a_n-1 a_n-2 … a1 a0] into

31^(n+1) + 31^n * a_n + 31^(n-1) * a_n-1 + … + 31 * a1 + a0

using wraparound int multiplication and addition

One way to think about it is as if there were a orderedCollectionHashCombine function that works as follows:

public static final int orderedCollectionHashCombine(int collectionHash, int itemHash) {

return 31*collectionHash+itemHash;

}

and then a collection’s hash is computed by starting with a collectionHash of 1 for the empty [] and combining in the hashes for the items one at a time, in order, using the above method.  (In the actual Clojure code, this is all done in-line, rather than calling a separate combine function, but that’s essentially what is going on).

The main problem with vectors is that the magic constant of 31 results in a lot of collisions for small vectors of small numbers.  For example, a common way to represent a 2D array is as a map from [x y] pairs to values.  Imagine a 200x200 2D array implemented in this way.  The keys will be [0 0], [0 1], [0 2], … [199 198], [199 199].

Since 31 is a small number, a change in the first component of the vector is easily matched by a change in the second component of the vector.  Consider:

=> (hash [6 0])

1147

=> (hash [5 31])

1147

=> (hash [4 62])

1147

=> (hash [3 93])

1147

=> (hash [2 124])

1147

=> (hash [1 155])

1147

=> (hash [0 186])

1147

On the 40000 vectors [0 0] … [199 199], there are only 6369 possible hash values, resulting in 6-7 times more collisions than there should be.  For a [0 0] … [299 299] it’s more like a factor of 10, etc.

So is there an easy way to improve this without increasing the hashing time of vectors?  Yes.

The simplest possible way is to just make the number 31 bigger.  A good choice is 524287 which, like 31, is also a prime number and one less than a power of 2 (known as a Mersenne prime).  One reason it is popular to pick such numbers is that multiplying by a Mersenne prime can be rewritten as a bitshift-left followed by a subtraction, which on some architectures is faster.  524287 is the largest Mersenne prime that can work as a multiplier for 32-bit ints (the next biggest Mersenne prime is 2^31-1, which is too close to the wrap-around point, creating non-ideal patterns).  Nevertheless, I believe that on modern architectures, there is no particular performance advantage to using Mersenne primes, and any large prime number could suffice.  Picking one at random, I chose 122,949,829 which seems to work quite well.

We can go further, and ask, “Why do we even need to use a prime number here?”  As far as I can tell, the answer is “It’s traditional to use a prime.”  Further analysis suggests that the only factor that really matters is that the multiplier must be odd, so that it is relatively prime to 2^32.  Beyond that, we’ll get the best results if we pick a number whose binary representation contains a fairly random pattern of 0s and 1s.  A popular way to choose such a number is to take an irrational constant, like pi or e, and use the first 32 digits of its binary representation.  If we do that with the golden number, we get the following int: -1640531527.  In my testing, this golden number as the multiplier outperforms the random large prime, which in turn outperforms the large Mersenne prime, which outperforms the current choice of 31.  (“Outperforms” in the sense of minimizing hash collisions -- the work performed computing the hash should be the same regardless of the multiplication constant.)

Note that if the hash values of longs are munged, no change is necessary to vectors.

Recommendation:

Change magic number from 31 to -1640531527 (or use 524287 if we find through benchmarks that the bit-shift/subtraction combination is noticeably faster than a multiplication).

Expected performance cost: None

Same amount of work is being done, but the constant has changed.

Benefit:

Potentially reduces hash collisions by an order of magnitude when dealing with small vectors of small numbers, a common use-case for vectors.

## Sets

Because sets are an unordered collection, the hashing algorithm needs to come out to the same result, regardless of the order of the items.  Right now, sets are hashed simply by adding up the hash values of its items.  This creates several issues:

1. 0 doesn’t matter.

=> (hash #{1})

1

=> (hash #{0 1})

1

1. Nested structure doesn’t matter.

=> (hash #{1 2 3 4})

10

=> (hash #{#{1} 2 3 4})

10

=> (hash #{#{1 2} #{3 4}})

10

=> (hash #{#{1 3} #{2 4}})

10

This is especially problematic in applications that involve partitioning a set of items into subsets, as all partitions would have the same hash value.

1. Sums collide easily for small numbers.

Consider all the sets comprised of numbers from 0-15.  There are 65536 such sets.  But there are only 121 possible hash values for those sets, potentially resulting in a tremendous number of collisions.

1. Sets of similar vectors easily collide.

Consider the following vectors, which all hash to different values:

=> (hash [1 1])

993

=> (hash [1 2])

994

=> (hash [2 1])

1024

=> (hash [2 2])

1025

But watch what happens when you start combining them in a set:

=> (hash #{[1 2] [2 1]})

2018

=> (hash #{[1 1] [2 2]})

2018

What’s going on here is that:

(hash [a b]) is 31^2 + 31*a + b and

(hash [c d]) is 31^2 + 31*c + d

When you add them up you get:

2* 31^2 + 31 * (a+c) + (b+d)

Similarly, you get the same sum when you add (hash [c b]) and (hash [a d]).

In general, if the numbers in each vector position add to the same sum, the hash of the overall set will be the same.

Back when discussing numeric hashing, we talked about the possibility of munging the longs to get a wide distribution of hashes.  That would help with point 3 above, and help a little bit with point 4, but doesn’t address points 1 and 2.

The simplest way to comprehensively improve the behavior of set hashing is to munge the hash value of each item prior to adding them.  Here is a function that munges an int:

public static final int xorShift32(int a) {

a ^= (a << 13);

a ^= (a >>> 17);

a ^= (a << 5);

return a;

}

Simply apply this function to the hash value of each item in the set, and add them together.  This takes care of all the issues discussed above.

Recommendation:

Munge the 32-bit hashes of the set items prior to adding them together.

Expected performance cost:  Negligible

It is clear that this does perform more operations than the existing set hashing scheme, but the cost of set hashing is dominated by other factors, such as traversing the set.  Also, sets cache their hash values, so this is not a repeated cost.  I’m hard-pressed to see any performance differences in benchmarks.  I think it is also possible that the performance could be made even better than the current scheme by switching to an incremental hashing scheme, updating the set’s cached hash whenever an item is conj’ed or disj’ed from a set, thus avoiding the need for a costly traversal when the set is hashed.

Benefit:

Prevents an explosion of collisions between similar sets.

## Maps

Maps currently hash as follows:  for each [key value] pair, xor together the (hash key) and the (hash value), then add them all together.  This suffers from similar problems as sets.  Permuting keys and values often gives back close, if not exactly the same, hash values.

To fix this, we can employ the same techniques as with sets.  After xor’ing together (hash key) and (hash value), we munge that result before adding it in.

However, there are a couple of issues that remain if we munge the result after xor’ing together (hash key) and (hash value).

1. Because xor is a commutative operation, {1 2} hashes to the same value as {2 1}

2. All entries where the key equals the value xor to 0, so {0 0}, {0 0 1 1}, {0 0 1 1 2 2} all yield the same result (0).

3. Also, since (munge 0) is 0, this means that inserting any (key=value) pair into a map will not change the hash value.  For example, whatever {:a 100 :b 200} hashes to, then {:a 100, :b 200, 5 5} will hash to the same thing.

We can address nearly all of these issues with one small modification: we munge just the (hash value), and xor it with the (hash key).  By munging one part of the map entry before the xor, we introduce an asymmetry that prevents commutative clashes like {1 2} hashing to the same value as {2 1}, and prevents key=value map entries like {2 2} from hashing to 0 (with the minor exception of {0 0} which would still hash to 0).

Recommendation:

Change map hashing to be the sum of all (xor (hash key) (munge (hash value))).

Expected performance cost: Negligible

Similar to sets, the munge is definitely an extra operation, but negligible relative to other hashing costs.

Benefit:

Maps are used pervasively in Clojure as the language’s “objects”, and frequently used as keys.  Reducing collisions between similar maps is worthwhile.

## Records

The type of the record is xor’ed with the map’s hash value, but that doesn’t really have any bearing on the collision issue.  Same change as maps is recommended.

## Overall

These collisions multiply with compound data structures.  Fixing just numbers with the munging algorithm has the greatest “ripple effect” that will be felt at all levels.  One benefit of changing hashing only at the numeric level is that it won’t require any reimplementation of any other data structure that has been engineered to match Clojure’s hashing strategies.

However, addressing only numbers won’t entirely fix all classes of common collisions in sets and maps.  For example, set partitions will still all hash to the same value.  The only way to address those issues is to also strengthen hashing at the collection level.

## Java Interop

A big question is whether changing Clojure’s hashing strategy will have any effect on java interop.

Right now, Clojure provides two separate hash methods: hashCode for Java compatibility, and hasheq for internal Clojure hashing.  As long as we only modify hasheq, I don’t think there will be a problem.  When a Clojure object is passed to Java world, Java will call the hashCode method, which matches Java’s expectations.

Clearly this is an important issue that merits a thorough investigation, but I cannot currently imagine a scenario where it poses a problem to have Clojure’s hasheq method behave differently than hashCode.

## Xorshift

The improvements outlined in this document are mostly small, incremental improvements to the basic underlying Java algorithm.  Sequences still multiply by a constant and add in the hash of the new element.  Sets still add up the hashes, and maps still xor together the hashes of the map entries and add them together.

The main difference is the strategic use of a “munging” operation which scrambles the hash values enough to get better distribution, and to destroy things like commutativity and distributivity which can cause an inordinate number of collisions.

The munging operations discussed above come from a paper, describing a family of Xorshift RNG algorithms: http://www.jstatsoft.org/v08/i14/paper

These algorithms are typically of the form:

a ^= a << magicNumber1;

a ^= a >>> magicNumber2;

a ^= a << magicNumber3;

return a;

The paper contains a whole table of magic numbers that work for 64-bit numbers, and another table for 32-bit numbers.  Any of the combinations listed in the table work well, but I simply picked the combinations that the author listed as his favorite.

The big idea is that these algorithms create an enormous cycle of all the ints/longs (depending on which version you are using) other than 0 (which maps to itself).  In Clojure pseudocode, (iterate xorshift32 1) will give you a sequence of all 2^32-1 positive ints before looping back to 1.

The original purpose of the algorithm is as a random number generator: you start with some arbitrary seed and just keep calling xorshift32 repeatedly to get new numbers.  But for our purposes, the interesting aspect of the algorithm is that xorshift32 is a one-to-one mapping from ints onto ints.  It’s a great way to scramble the ints and map ints that are close to one another to ones that are not, getting better distribution of the hash values.

xorshift32 is perfect for munging int hashes of items in a collection before using the standard Java combining strategy.  But one of the other possible uses of xorshifting is as a way to get better distribution of numeric hashes while crunching longs down to ints.  The strategy described under the section Longs works as follows:  the first three lines are straight out of the table for 64-bit numbers in the xorshift article.  This gives us better distribution of the hashes, but we still need to crunch it down to 32 bits.  That’s what the fourth line does, using the standard a^=a>>>32 technique used currently in Clojure to combine the higher order and lower order bits of longs.

## Alternative hashing strategies

I think there’s a lot to be said for preserving the basic shape of the existing hashing algorithms, simply inserting a few key xors and shifts to give better distribution.  However, it must be acknowledged that there are many different hashing algorithms out there.

All hashing algorithms sit somewhere on the spectrum between fast and scrambles the inputs so well that it is cryptographically secure.  An ideal hashing function scrambles the inputs beyond recognition, with no easy way to recover the input from the output.  But the choice of hash function is dependent on its application, and for the mere purpose of preventing collisions in a hash table, this kind of heavy-duty scrambling is overkill.

Clearly the Java philosophy was to choose things at the very fast end of the spectrum, choosing algorithms that were just effective enough to prevent collisions for the kinds of inputs that are common as keys in Java (but unfortunately, not quite effective enough for the kinds of inputs that are common as keys in Clojure).  For hashing aficionados, it must be appalling that in Java, numbers hash to themselves -- in some sense, that defies the notion of what hashing is all about.  But as long as it stops collisions, that’s all we really care about in this context.

Some languages use hashing algorithms that scramble more thoroughly than the solutions described above.  One such algorithm is known as Murmur3.  When hashing collections, for example, Murmur3 uses several shifts, bitwise ors and xors, additions, and multiplications to thoroughly munge each item and combine it with the total hash for the collection.  Then, at the end of processing all the items, it performs several more operations to mix up the combined collection hash.

It’s amazingly effective, and fast at what it does.  But it’s definitely slower than the Java+xorshift techniques described above, and if the simpler, faster technique is good enough to prevent collisions, is there any reason to mix up the hashes more thoroughly?

Furthermore, there are a couple aspects of Murmur3 (and other state-of-the-art robust hashing algorithms) that are downright inconvenient.  First, as mentioned before, Murmur3 does additional mixing operations on the collection as a whole (effectively, sealing the collection).  Also, Murmur3’s mixing operations are not easily reversible.

With the current Java way of doing things (even with the modifications discussed above), it is easy to make a collection that hashes incrementally as you go along, updating the hash when adding, changing, and removing items from the collection.  That’s because the collection’s hash is always in a state that is receptive to incorporating hashes of more items, and we can reverse certain mixing operations to reflect the removal of an item.  This is especially beneficial when hashing several variations derived from the same immutable data structure.  Similarly, when Clojure hashes a sequence, if it gets to a tail that has already been hashed and cached, the mathematical simplicity of sequence hashing means it can straightforwardly incorporate the cached hash of the rest of the sequence without actually traversing those elements.  This can be a huge savings when hashing multiple sequences that have a shared structure resulting from consing different fronts onto a common tail.  Actually, this general principle applies to other sorts of structural sharing as well: the math is quite transparent, and one can easily incorporate the cached hashes of any substructures.

Murmur3’s mixing is too good in a sense, making it difficult to implement these kinds of time-saving strategies.  For example, if you wanted to do incremental hashing, you’d have to cache the hash of the combined items before the collection is “sealed” by the final mixing operations performed on the overall collection.  That way, when you wanted to extend the collection in some way, you could go back to that cached hash.  But this means that every time you want the hash of the collection, you have to repeat those final mixing steps, or have a second cached value reflecting the final hash after the collection mixing.  As you can see, techniques for incremental hashing and structural caching with Murmur3 may exist, but they are markedly more complicated and less efficient than the simplicity of the current approach.

Labels: