**You are viewing an old version of this page. View the**Compare with Current | View Page History

**current version.**### Problem

Clojure’s hashing strategy for numbers, sequences/vectors, sets, and maps mimics Java’s. In Clojure, however, it is far more common than in Java to use longs, vectors, sets, maps and compound objects comprised of those components (e.g., a map from vectors of longs to sets) as keys in other hash maps. It appears that Java’s hash strategy is not well-tuned for this kind of usage. Clojure’s hashing for longs, vectors, sets, and maps each suffer from some weaknesses that can multiply together to create a crippling number of collisions.

For example, Paul Butcher wrote a simple Clojure program that produces a set of solutions to a chess problem. Each solution in the set was itself a set of vectors of the form [piece-keyword [row-int col-int]]. Clojure 1.5.1's current hash function hashed about 20 million different solutions to about 20 thousand different hash values, for an average of about 1000 solutions per unique hash value. This causes PersistentHashSet and PersistentHashMap to use long linear searches for testing set/map membership or adding new elements/keys. There is nothing intentionally pathological about these values – they simply happened to expose this behavior in a dramatic way. Others have come across similarly bad performance without any obvious reason why, but some of those cases are likely due to this same root cause.

### Proposed solutions

Mark Engelberg's document about Clojure's hash function, its behavior, and potential improvements, is here:

https://docs.google.com/document/d/10olJzjNHXn1Si1LsSvjNqF_X1O7TPZLuv3Uo_duY5o

A summary of his proposed hash function modifications is:

- Change the hash of integers that fit within a long to the return value of longHashMunge (see Longs section of doc for more details)
- Change the current multiplier of 31 used for vectors, sequences, and queues to a different constant such as -1640531535 or 524287 (see Vectors section). Also applies to Cons, LazySeq.
- For sets, instead of adding together the hash value of the elements, add together the return value of a function xorShift32 called on the hash value of each element (see Sets section)
- For maps and records, instead of adding together hash(key) ^ hash(val) for each hash,val pair, instead add together hash(key)^xorShift32(hash(val)) (see Maps section)
- No change for any other types, e.g. strings, keywords, symbols, floats, doubles, BigInt's outside of long range, BigDecimal, etc.

Below is a link to a modified version of Paul Butcher's N-queens solver, with extra code for printing stats with several different hash functions. The README has instructions for retrieving and installing locally a version of Clojure modified with one of Mark's proposed alternate hash functions. After that is a link to a patch that implements the proposal above.

https://github.com/jafingerhut/chess-clojure

http://dev.clojure.org/download/attachments/8192765/better-hashing-2013-11-18.diff

Here is a summary of results for some program elapsed times and how spread out the hash values are. Below the table are a few details on how these measurements were made.

Problem | Using Clojure 1.6.0-alpha3 hash (same as Clojure 1.5.1) | Using Mark Engelberg's 2013-11-18 proposed hash |
---|---|---|

Paul Butcher's N-queens problem with 6x6 board | Elapsed time: 6.7 min (~33 times slower than with 2013-11-18 proposed hash) 180,568 solutions hash to 3,013 distinct hash values average of 59.9 solutions per hash value (max 2,492) | Elapsed time: 12.2 sec 180,568 solution hash to 180,566 distinct hash values average of 1.0 solutions per hash value (max 2) |

with 6x9 board | Elapsed time: > 8 hours (did not wait for it to finish) 20,136,752 solutions hash to 17,936 distinct hash values average of 1,122.7 solutions per hash value (max 81,610) | Elapsed time: 11.4 min 20,136,752 solutions hash to 20,089,766 distinct hash values average of 1.0 solutions per hash value (max 4) |

Compile Clojure 1.6.0-alpha3 source with "time ant jar", so no tests run | Elapsed time: avg 20.1 sec (min 19.6, max 20.9) raw measurements: 19.6, 20.2, 20.9, 19.8, 19.8 | Elapsed time: avg 20.0 sec (min 19.6, max 20.7) raw measurements: 19.9, 20.2, 19.6, 20.7, 19.7 |

Compile Clojure 1.6.0-alpha3 source with "time ant", which includes running tests, but with generative test duration reduced to 1.0 sec | Elapsed time: avg 48.0 sec (min 47.3, max 49.5) raw measurements: 47.6, 49.5, 47.3, 47.4, 48.3 120,353 unique values hash to 113,405 distinct hash values average of 1.06 values per hash value | Elapsed time: avg 48.0 sec (min 47.1, max 49.5) raw measurements: 47.1, 48.2, 47.6, 47.7, 49.5 119,811 unique values hash to 114,329 distinct hash values average of 1.05 values per hash value |

Calc hashes of all integers in (range 1000000000) and return sum, using: (time (hash-range-n 1000000000)) See here for definition of hash-range-n | time: avg 29.8 sec raw measurements, sorted: 29.6, 29.6, 29.9, 30.0, 30.1 | time: avg 38.7 sec (30% longer) raw measurements, sorted: 38.6, 38.6, 38.7, 38.7, 38.7 Verified that hash values of first 500 million integers (those in (range 500000000)) are all different. |

Calc hashes of 30,001 vectors [] [0] [0 1], etc. up to [0 1 ... 29999] and return sum, using (let [cs (doall (reductions conj [] (range 30000)))] (time (total-hash cs))) See here for definition of total-hash | time: avg 10.7 sec raw measurements, sorted: 10.6, 10.6, 10.7, 10.7, 10.7 30,000 unique hash values, avg 1.00, max 2 | time: avg 10.8 sec (pretty much same) raw measurements, sorted: 10.5, 10.7, 10.7, 10.7, 11.6 30,000 unique hash values, avg 1.00, max 2 |

Calc hashes of 30,001 sets #{} #{0} #{0 1}, etc. up to #{0 1 ... 29999} and return sum, using (let [cs (doall (reductions conj #{} (range 30000)))] (time (total-hash cs))) | time: avg 70.2 sec raw measurements, sorted: 66.4, 69.1, 69.4, 71.3, 74.8 30,000 unique hash values, avg 1.00, max 2 | time: avg 71.4 sec (1.7% longer) raw measurements, sorted: 70.9, 71.0, 71.2, 72.0, 72.1 29,308 unique hash values, avg 1.02, max 2 |

Calc hashes of 30,001 maps {} {0 1} {0 1, 1 2} ... up to {0 1, 1 2, ..., 29999 30000} and return sum, using (let [cs (doall (reductions (fn [m i] (assoc m i (inc i))) {} (range 30000)))] (time (total-hash cs))) | time: avg 71.7 sec raw measurements, sorted: 68.4, 69.0, 69.1, 75.7, 76.3 30,001 unique hash values, avg 1.00, max 1 | time: avg 78.5 sec (9.5% longer) raw measurements, sorted: 74.1, 75.0, 77.0, 81.2, 85.4 30,001 unique hash values, avg 1.00, max 1 |

Notes on measurements:

Only the N-queens measurements used Leiningen. The compilation measurements used the ant commands shown. The rest were measured using the expressions shown after starting a JVM with the command line:

java -cp clojure.jar clojure.main

with the version of clojure.jar given in the column heading. 5 measurements were taken for each. The raw measurements are given, and the average. Each individual run is intended to be long enough to avoid any concerns about misleading measurements from microbenchmarks.

Benchmark versions: hardware is MacPro 2,1 with 2 3GHz quad-core Intel Xeons, 32 GB RAM. OS is Mac OS X 10.6.8. JVM is Apple/Oracle-supplied version 1.6.0_65, 64-bit Server VM.

### Open questions

Possible small changes to the proposal for consideration:

- Add in some magic constant to the hash of all sets and maps (a different constant for sets than for maps), so that the empty set and empty map have non-0 hash values, and thus nested data structures like #{{}} and {#{} #{}} will hash to non-0 values that are different from each other.
- Consider doing something besides the current hash(numerator) ^ hash(denominator) for Ratio values. Perhaps hash(numerator) + xorShift32(hash(denominator)) ? That would avoid the hash of 3/7 and 7/3 being the same, and also avoid the current behavior that hash(3/7) "undoes" the longHashMunge proposed for both the numerator and denominator long values.

### Tradeoffs and alternatives

These are discussed throughout Mark's document. A few of these are called out below

- Nearly all of the proposals involve additional operations on ints or longs. This is expected to require little additional elapsed time in most hash computations, given that the most significant cost in hashing collections is usually traversing the parts of the data structure, especially if that involves cache misses. Measurements are given above for one proposed set of hash function changes.
- Murmur3 is widely used, but does not lend itself well to incremental updates of hash calculations of collections.

### References

Some useful references:

- Xorshift RNGs - source for the "munge" algorithm
- Knuth, section 6.4 - popular multiplier for multiplicative hashing of sequences
- MurmurHash3
- Which hashing algorithm is better for uniqueness and speed
- Scala's MurmurHash3 implementation (used for collection hashing) and more info on how to use it